research article modules for pipelined mixed radix...

Research ArticleModules for Pipelined Mixed Radix FFT Processors

Anatolij Sergiyenko and Anastasia Serhienko

Computer Science Department National Technical University of Ukraine Peremogy Avenue 37 Kiev 03056 Ukraine

Correspondence should be addressed to Anatolij Sergiyenko asergybigmirnet

Received 26 October 2015 Revised 2 January 2016 Accepted 5 January 2016

Academic Editor Michael Hubner

Copyright copy 2016 A Sergiyenko and A Serhienko This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

A set of soft IP cores for theWinograd 119903-point fast Fourier transform (FFT) is consideredThe cores are designed by the method ofspatial SDF mapping into the hardware which provides the minimized hardware volume at the cost of slowdown of the algorithmby 119903 times Their clock frequency is equal to the data sampling frequency The cores are intended for the high-speed pipelined FFTprocessors which are implemented in FPGA

1 Introduction

Fast Fourier transform (FFT) algorithm is widely used inmany signal processing and communication systems Due toits intensive computational requirements it occupies largearea and consumes high power if implemented in hardware

FFT uses divide and conquer approach to reduce thecomputations of the discrete Fourier transform (DFT) InCooley-Tukey radix-2 algorithm the 119873-point DFT is subdi-vided into two (1198732)-point DFTs and then (1198732)-point DFTis recursively divided into smaller DFTs until a two-pointDFT The last procedure named as radix-2 butterfly is justan addition and a subtraction of complex numbers

Higher radix algorithms such as radix-4 and radix-8can be employed to reduce the complex multiplications butthe butterfly structure becomes complex So a split radixalgorithm [1] is adopted to get the benefits of both radix-2and radix-4 algorithms

Prime factor algorithms use Good-Thomas mapping andChinese Remainder Theorem for decomposing the 119873-pointDFT into smaller DFTs which are the factors of 119873 and aremutually prime [2] With this mapping the twiddle factormultiplications are avoided at a cost of increased numberof additions and irregular structure A modification of theprime factor algorithm is Winograd fast Fourier transformalgorithm It is capable of achieving minimum complexmultiplications but the number of additions is increased

Pipelined FFT architectures are fast and high throughputarchitectures with parallelism and pipelining [3] Amongthem the single-path delay feedback architecture and multi-path delay commutator architecture are the most popular

In the first kind of architectures each of the pipelinestages contains twiddle factor multiplier 119903-point DFT unit(119903 = 2 4) and data buffer which stays in the feedback ofthe stage This buffer is filled before the DFT computationsTherefore such structure could not be fully loaded [4 5]

In the second kind of architectures the pipeline stagescontain data buffer 119903-point DFT unit and twiddle factormultiplier which are connected in sequence The data bufferis based on multipath delay commutator and provides setsof 119903 complex data which feed the DFT unit in parallel Sucharchitecture provides the maximum throughput at the cost ofthe high hardware volume [3 6] Besides it must implementthe uniform radix 119903 FFT because for the mixed radix FFT thedata buffers become too complex

Systolic array scheme has also been proposed for FFTcomputations [7 8] The 119903-point DFT in it is calculated as 119903separate sums of 119903 weighted samples It is attractive becauseof its regularity scalability locality of interconnections andsuitability for non-power-of-two transforms However suchprocessor requires substantial hardware volume For exam-ple the 16 times 16-point processor for 16-bit data contains 3982adaptive logic modules (ALMs) and 33 multipliers of theAltera Stratix III FPGA comparing to the usual 20-bit width

Hindawi Publishing CorporationInternational Journal of Reconfigurable ComputingVolume 2016 Article ID 3561317 7 pageshttpdxdoiorg10115520163561317

2 International Journal of Reconfigurable Computing

pipelined FFT processor which contains 4261 ALMs andonly 24 multipliers [8]

The implementation of the pipelined FFT architecture inmodern field programmable gate arrays (FPGAs) providesthe high-speed hardware solution with small energy con-sumption One FFT of 256 16-bit complex points dissipatesapproximately one microjoule in FPGA [6] The FFT pro-cessor for 119873 = 4096 which occupies 367 k ALMs and 60multipliers provides the speed up to 90 GFLOPS and theefficiency near 10 GFOPS per watt which is in many timeshigher than in CPU or GPU implementation [9] Besidesthis architecture can be accommodated in FPGA to the solvedproblem conditions by exchanging the throughput transformlength119873 or computation precision

The papers [10ndash12] describe the design and implementa-tion of radix-22 single-path delay feedback pipelined FFT

In most cases the power-of-two FFT processors areimplemented in FPGA In the paper [13] it was proven thatRadix-2 FFT method provides the least number of FPGAslices the Good-Thomasmethod is faster than Cooley-Tukeyand the Rader method had the lowest operating frequency ofthe pipelined processor in FPGA

The non-power-of-two transforms are widely used in theOFDMmodems In such transform the Winograd algorithmminimizes the number of multiplications in the DFT mod-ules but also adds a degree of complexity and significantlyincreases the total number of utilized adders in FPGA [14 15]

In [16] the parallel architecture of the DFT module hasbeen proposed for the computation of this algorithm Thisarchitecture is able to deal with a large amount of FFT sizesdecomposable in product factors that are 2 3 4 5 7 or 8

In [17] the pipelined processor design is proposed whichuses the Cooley-Tukey FFT algorithm for FFT computationonly in those cases where the factors of the number119873 are notrelatively prime

The DFT modules which are used in the examples ofthe pipelined FFT processors are designed by the one-to-one mapping of the respective small point FFT algorithmsAs a result they need the data feeding through 119903 input portsConsider the two stage pipeline the number119873 is factored tofactors119901 119903 119901 = 119903Then the buffer has 119903 FIFOs of the length ofmore than 119901 which are fed from 119901 inputs in the nonuniformorder Therefore to provide the proper input data order forthese stages the complex data buffers must be attached totheir ports

Consider the DFT modules which accept the 119903 inputdata sequentially for 119903 steps Then both data buffers andtwiddle factor multipliers are simplified substantially Thesemodules have the slowdown operation in 119903 times Besides thehardware volume of the DFTmodules can be decreased up to119903 times The disadvantage of this architecture is the decreaseof the FFT processor throughput up to 119903 times But in thissituationwe can provide the proper system throughput by theincrease of the FFT processors number which are configuredin FPGA

In this paperwe propose the design of a set of 119903-pointDFTunits which help to implement the pipelined FFT processorswhen the data flow is a single sample per a clock cycle

2 A Method of Pipelined Datapath Synthesis

By the high level synthesis the DSP algorithm is usuallydescribed by a signal flow graph or a synchronous data flow(SDF) In SDF the nodes-actors and edges represent theoperators and data transfers between them respectively Eachactor consumes and generates the same amount of data ineach SDF cycle [21 22]

Uniform SDF has the property that its graph is equal tothe graph of the pipelined datapath which implements thealgorithm with the period of 119871 = 1 clock cycle Then the SDFnodes are mapped into the operational resources like addersmultipliers the edges are mapped into the connections andthe delays in edges represent the pipelined registers Thisproperty is widely used by the synthesis of DSP modulesfor FPGA in many CAD tools like Matlab-Simulink SystemGenerator [23]

The synthesis of the pipelined datapath with the periodof 119871 gt 1 cycles is usually performed by the steps of resourceselection actor scheduling and resource assignment Thenthe datapath structure is found and the control unit issynthesized

It is worth mentioning that most of DFT algorithmsare acyclic ones The most popular scheduling methods forlimited resources and execution time consider the acyclicSDF These methods are list scheduling and force directedscheduling [24] The register allocation is effectively imple-mented by the Tseng heuristic and by the left edge schedulingThe use of the cyclic interval graph takes into accountthe cyclic nature of the SDF algorithm [25] The retimingmethods and the graph folding methods simplify the SDFmapping [26 27]

In [28 29] the method of the datapath synthesis isproposed which is based on SDF This method adapted tothe acyclic SDF is described below In this method SDF isrepresented in the three-dimensional space in the form ofa triple 119870

119866= (119870119863119860) where 119870 is the matrix of vectors-

nodes K119894 which mean the operators or actors 119863 is matrix

of vectors-edgesD119895 performing the links between operators

and 119860 is the incidence matrix of SDFIn the vector K

119894= (119896119894 119904119894 119905119894)119879 the coordinates 119896

119894 119904119894 and

119905119894correspond to the type of operator the processor unit

(PU) number and the clock cycle The SDF graph in suchrepresentation is called spatial SDF

Spatial SDF is split into the spatial configuration 119870119866119878

=

(119870119878 119863119878 119860) and event configuration 119870

119866119879= (119870

119879 119863119879 119860)

which correspond to the datapath structure and its scheduleBy the splitting process the vectors K

119894= (119896

119894 119904119894 119905119894)119879 are

decomposed into vectors K119878119894= (119896119894 119904119894)119879 corresponding to

the PU coordinates and vectors K119879119894

= 119905119894 which mean the

execution times of the relevant operators in PUK119878119894 Then the

temporal componentD119879119894= 119905119894of the vector D

119894is equal to the

delay of transfer or processing of the respective variableWe can assume that the matrix 119870 encodes some accept-

able structural solution since the matrix119863 is calculated by

119863 = 119870119860 (1)

The structural optimization consists in finding suchmatrix 119870 which minimizes a given quality criterion It is

International Journal of Reconfigurable Computing 3

possible to specify a matrix119863119874which provides theminimum

value of119879119862Then the vectorsK

119894are found from a relationship

119870 = 119863119874119860119874

minus1 (2)

where 119863119874is the matrix of vectors-nodes and 119860

119874is the inci-

dence matrix of the maximum spanning tree for SDF Whenlooking for the effective structural solution the followingrelations have to be considered Spatial SDF is valid if thematrix119870 has no two identical vectors that is

forallK119894K119895

(K119894= K119895 119894 = 119895) (3)

The schedule with the period of 119871 clock cycles is correctif the operators which are mapped into the same PU areperformed in different cycles that is


(119896119894= 119896119895 119904119894= 119904119895) 997904rArr 119905

119894equiv 119905119895mod 119871 (4)

This inequality provides the correct circular scheduleMoreover the next operator has to be executed no earlier thanthe previous one that is

forallD119895

(119905119894ge 0) (5)

The operators of the same type should bemapped into PUof the same type that is

K119894K119895isin 119870119901119902

(119896119894= 119896119895= 119901 119904

119894= 119904119895= 119902)

10038161003816100381610038161003816119870119901 119902

10038161003816100381610038161003816le 119871

(6)

where 119870119901119902

is a set of 119901-type vectors-operators which aremapped in the 119902th PU of 119901th type (119902 = 1 2 119902119901max)

Then the search for the effective schedule consists in thefollowing The vectors D

119894isin 119863119874are assigned the coordinate

119905119894= 1 that is the respective operators have the delays of

a single clock cycle The matrix 119870119879is found from (2) The

remaining elements of the matrix 119863119879are found from (1) If

inequality (5) is not satisfied for some of vectors then thecoordinate 119905

119894is increased for certain vectors D

119894isin 119863119874 and

the schedule search is repeated The rest of K119894coordinates

are found from conditions (3)ndash(6) In such wise the fastestschedule is built as each statement is executed in a singleclock cycle without unnecessary delays

The resulting spatial SDF can be described by the VHDLlanguage so the pipelined datapath description can be trans-lated into the gate level description of the FPGAconfigurationby the proper compiler-synthesizer [30]

During the structure synthesis the nodes are placed inthe space according to a set of rules providing the minimumhardware volume for the given number of clock cycles in thealgorithm period The resulting spatial SDF is described byVHDL language and is modelled and compiled using properCAD tools

The method is similar to the known method of the SDFfolding in 119871 times [31] However it is distinguished fromthe intuitive folding procedure in the formal approach anddirected optimization process In this method the stepsof resource selection operator scheduling and resource

allocation are implemented in a single step providing moreeffective optimization

Themethodwas built in the frameworkwhich is intendedfor the SDF graph input and its graphical editing Bothalgorithm and resulting structure are stored in XML filesTheframework can translate the XML description into the VHDLsynthesizable model which can be modelled and synthesizedby usual CAD tools provided by different companies Thepresent limitation consists in that the SDF graph is optimizedonly by hand using the relations definitions and theoremsmentioned above [32] The shown below examples weresynthesized by Xilinx ISE Ver 133

Themethod is successfully proven by the synthesis of a setof pipelined FFT processors IIR filters and other pipelineddatapaths for FPGA [33] A set of DFTmodules was designedusing it as well

3 Example of the DFT Module Synthesis

Consider a design example of a DFT module of 119903 = 3 pointsIt performs theWinograd DFT algorithm which is describedin [5]

119905 = 1199091+ 1199092

119901 = 1199092minus 1199091

1198980= 1199090+ 119905

1198981= (cos 2120587

3

minus 1) sdot 119905

1198982= 119895 sdot sin 2120587

3

sdot 119901

119904 = 1198980+ 1198981

1198830= 1198980

1198831= 119904 + 119898

2

1198832= 119904 minus 119898

2

(7)

where 1199090 1199091 1199092 is the input complex data set 119883

0 1198831 1198832

is the complex result set and 119895 = radicminus1 In this algorithmcos 21205873 = minus05 therefore 119898

1= minus15 sdot 119905 The algorithm has

twelve real additions and four real multiplicationsTominimize the number ofmultiply units (MPUs) and to

increase the clock frequency it is worth to use the applicationspecific MPUs [34] Then the coefficient 119896 = sin 21205873 =

0866025 = 011011101101101002 To minimize the addition

operations it is represented by digits 1 0 and minus1 that is119896 = 100

minus 10 00

minus 10 0minus 10minus 1 01

Then themultiplication 119896sdot119901 = sin 21205873sdot119901 is implementedas

119896 sdot 119901 = 119901 minus (119901 + 1199012minus4) 2minus3

+ ((1199012minus2minus 119901) 2

minus2minus 119901) 2

minus10

(8)

SDF of this algorithm is shown in Figure 1 The blackcircles in it represent the input-output nodes circle with a


j

t

p

j

s

minus

minusminus

minus

minus

minus

x0

x2

x1

m0

m2

minusm1

X0

X2

X1

≫1≫2

≫2 ≫10

≫3

≫4

Figure 1 SDF of the 3-point DFT

0 1 2 0 1 2 0 1 t mod 3

t

s

p

In

Out

j j

s

t1 2 3 4 5 6 70

X0 X2X1

minus

minus

minusminus

minus

minus

x0x2x1

m0

m2

≫1

≫2

≫2

≫10≫3

≫4

S1

S2

R2

S3

S4

S5

R1

S6

Figure 2 Spatial SDF of the 3-point DFT

cross does complex addition and symbols ldquo≫119897rdquomean the shiftright operation to 119897 bitsThe edge which is loaded by 119895 meansthe multiplication to 119895 which is performed as inversion ofthe image part of data and swapping the real and image partsof data Each node performs a delay to a single clock cycleTherefore this SDF makes the structure of a module whichcomputes DFT with the period of 119871 = 1 cycle

Due to the method described above SDF is representedin the three-dimensional space for 119871 = 3 as a spatial SDFwhich is illustrated by Figure 2 Comparing to SDF in Figure 1the spatial SDF is placed in the space with the coordinates ofresources 119904 and time 119905 The coordinates of a node in it meanthe number of PUs where it is triggered and the number ofthe clock cycles The operator type is coded by the character119877 as a register and character 119878 as an adder Below the axis 119900119905the axis with figures (119905mod 3) is placed to simplify the checkof the SDF correctness due to formula (4)

We can see that the spatial SDF codes both the algorithmschedule and the module structure where it is performed

MUX

In

R1

S1RG

S2RG

MUXS3

RG

MUX

MUXS4

RG

MUXR2

MUX

MUXS5

RG

MUX

S6RG

Out

j

minus

minus minus

minus

≫1

≫2

≫2

≫10≫3

≫4

plusmn

Figure 3 Structure of the DFT module

This SDF is formally translated into VHDL description in thesynthesis style as shown in Algorithm 1

Here the signals and ports are used which represent theoutputs of the respective operator nodes of SDF in Figure 2Note that the complex variables are substituted by a coupleof signed-type signals with the suffixes 119903 119894 in their namesrespectively The input impulse START synchronizes thephase generator which is described in the process operatorCNTRL It generates the phase number signal CYC with theperiod of 119871 = 3 clocks

The DFT calculations are performed in the processoperator CALC The CASE operator in it consists of threealternatives depending on the signal CYC In the 119894th alter-native the operators are placed which are performed in theclock cycle 119894 = 119905mod 3 due to the SDF in Figure 2

Each signal assignment in this process is mapped intothe respective pipeline register because the activation of itis implemented in the rising edge of the clock signal TheCASE operator alternatives are mapped into respective PUsof adders using the known resource sharing technique [35]The resulting DFT module structure is shown in Figure 3 Itis shown only as an illustration because its forming is notnecessary for the module design

As we can see in Figure 3 the adders with two inputmultiplexers are placed between two neighboring registers inthismodule A single digit of such unit ismapped into a single6-input CLB which is used inmodern FPGAsTherefore thismodule has the shortest critical path and maximized clockfrequency

4 Results of the DFT Module Synthesis

A set of DFT modules was designed using the method ofspaced SDF described above Each of them inputs a single


library IEEEuse IEEESTD LOGIC 1164all IEEESTD logic arithallentity DFT3 is

port(CLK in STD LOGICSTART in STD LOGICDRI in std logic vector(15 downto 0)DII in std logic vector(15 downto 0)DRO out std logic vector(17 downto 0)DIO out std logic vector(17 downto 0))

end DFT3architecture synt of DFT3 issignal S1rS2rS3rS5rS6rR1rR2r signed(17 downto 0)signal S1iS2iS3iS5iS6iR1iR2i signed(17 downto 0)signal S4rS4i signed(17 downto 0)signal CYC natural range 0 to 3

beginCNTRLprocess(CLK) begin

if rising edge(CLK) thenif START =1 thenCYClt=0

elseif CYC =2 thenCYC lt=0

elseCYC lt=CYC +1

end ifend if

end ifend processCALC process(CLK) begin

if rising edge(CLK) thencase CYC iswhen 0 =gt

S1rlt= signed(SXT(DRI S1r1015840length))S1ilt= signed(SXT(DII S1I1015840length))S2rlt=S1r minus S2rS2ilt=S1i minus S2iR2rlt= S1rR2ilt= S1iS4rlt= SHR(S3r 010) minus R1rS4ilt= SHR(S3i 010) minus R1iS5rlt=R1r minus SHR(S4r 011)S5ilt=R1i minus SHR(S4i 011)S6rlt= R2r minus S5iS6ilt= R2i + S5r

when 1 =gtS1rlt= S1r + signed(DRI)S1ilt= S1i + signed(DII)R2rlt= S2rR2ilt= S2iS3rlt= S1r minus signed(DRI)S3ilt= S1i minus signed(DII)S5rlt=S5r + SHR(S4r 1010)S5ilt=S5i + SHR(S4i 1010)S6rlt=R2rS6ilt=R2i

when others=gt

Algorithm 1 Continued

S1rlt= S1r + signed(DRI)S1ilt= S1i + signed(DII)S2rlt= S1r + SHR(S1r 001)S2ilt= S1i + SHR(S1i 001)S3rlt= SHR(S3r 010) minus S3rS3ilt= SHR(S3i 010) minus S3iS4rlt= S3r + SHR(S3r 0100)S4ilt= S3i + SHR(S3i 0100)R1rlt=S3rR1ilt=S3iS6rlt= R2r + S5iS6ilt= R2i minus S5r

end caseend ifend processDROlt= std logic vector(S6r)DIOlt= std logic vector(S6i)

end synt

Algorithm 1

Table 1 Results of configuration of the DFT modules

DFT length Hardware volumeLUTs + DSP48

Maximum clockfrequency MHz

3 245 6403 201 + 2 4334 215 5485 945 4358 1187 42415 = 3 sdot 5 2131 34616 3616 36864 = 8 sdot 8 1985 + 4 338128 = 8 sdot 16 5277 + 4 324

sample per clock cycle which provides the simple manner ofconnecting them in a system Besides the respective reorderbuffers based on the Xilinx SRL16 serial shift registers weresynthesized as well This helps to design the DFT modules ofthe higher order on the base of the Good-Thomas algorithmlike 119903 = 15 = 3 sdot 5 which have the minimized hardwarevolume

The results of configuring the modules in Xilinx Kintex-7 FPGAs for the 16-bit input data are shown in Table 1To compare the effect of the use of the application specificmultipliers the example of mapping the radix-3 DFTmodulewith two DSP48 multipliers is shown in the table as wellThe comparison of both DFT modules shows that the clockfrequency of the multiplier-free module can be increased upto 15 times

The analysis of Table 1 shows also that the clock frequencyof the module decreases with the increase of the transformlength This is explained by the fact that the ratio of delays inthe routes to the critical path delay in FPGAachieves 80 andhigherTherefore the place and route tool could not optimizeeffectively the large projects with a lot of interconnections


Table 2 64-point FFT processors configured in Xilinx FPGAs

FPGA series Hardware volumeCLBs + DSP48

Maximum clockfrequency MHz Reference

Spartan-3E758 + 8 170 [18]1063 + 12 116 [19]1984 + 4 127 Proposed

Virtex-5 695 + 24 384 [20]628 + 4 325 Proposed

The example of 64-point FFT processor is comparedwith similar processors in Table 2 Its advantages are smallhardware volume in the number of DSP48 units and highclock frequency by the nonrestrictive constraints

5 Conclusions

The implementation of the 119903-point DFT modules in FPGAprovides the design of the high-speed pipelined FFT proces-sors with optimized hardware volume It is proven that theDFT module with the slowdown operation in 119903 times has thehigh clock frequency and small hardware volume due to thepipelined calculations properties of the 6-input LUTs andapplication specific multipliers The designed DFT moduleswere used to build the pipelined FFT processors with119873 = 64128 and 256 which are deposited in the free IP core site [36]and can be downloaded for investigation and use

Our future work aims at design of the framework whichprovides automatic synthesis of pipelined FFT processorsbased on the DFT modules

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

References

[1] P Duhamel and H Hollmann ldquolsquoSplit radixrsquo FFT algorithmrdquoElectronics Letters vol 20 no 1 pp 14ndash16 1984

[2] H J Nussbaumer Fast Fourier Transform and ConvolutionAlgorithms Springer New York NY USA 1982

[3] L R Rabiner and B Gold Theory and Application of DigitalSignal Processing Prentice-Hall Upper Saddle River NJ USA1975

[4] E H Wold and A M Despain ldquoPipeline and parallel-pipelineFFT processors for VLSI implementationsrdquo IEEE Transactionson Computers vol 33 no 5 pp 414ndash426 1984

[5] S He and M Torkelson ldquoNew approach to pipeline FFTprocessorrdquo in Proceedings of the 10th International ParallelProcessing Symposium (IPPS rsquo96) pp 766ndash770 Honolulu April1996

[6] G Bi and E V Jones ldquoPipelined FFT processor for word-sequential datardquo IEEE Transactions on Acoustics Speech andSignal Processing vol 37 no 12 pp 1982ndash1985 1989

[7] P K Meher J C Patra and A P Vinod ldquoEfficient systolicdesigns for 1- and 2-dimensional DFT of general transform-lengths for high-speed wireless communication applicationsrdquoJournal of Signal Processing Systems vol 60 no 1 pp 1ndash14 2010

[8] J G Nash ldquoHigh-throughput programmable systolic array FFTarchitecture and FPGA implementationsrdquo in Proceedings ofthe International Conference on Computing Networking andCommunications (ICNC rsquo14) pp 878ndash884 IEEE HonoluluHawaii USA February 2014

[9] M Parker Optimizing Complex Floating Point Calculations onFPGAs Altera 2014 httpwwwembeddedcom

[10] A Chahardahcherik Y S Kavian O Strobel and R RejebldquoImplementing FFT algorithms on FPGArdquo International Jour-nal of Computer Science and Network Security vol 11 no 11 pp148ndash156 2011

[11] A Saeed M Elbably G Abdeel and M I Eladawy ldquoFPGAimplementation of Radix-22pipelined FFT processorrdquo in Pro-ceedings of the 3rdWSEAS International Symposium onWaveletsTheory and Applications in AppliedMathematics Signal Process-ing amp Modern Science (WAV rsquo09) pp 109ndash114 World Scientificand EngineeringAcademy and Society (WSEAS) January 2009

[12] K Harikrishna T R Rao and V A Labay ldquoFPGA implemen-tation of FFT algorithm for IEEE 80216e (Mobile WiMAX)rdquoInternational Journal of Computer Theory and Engineering vol3 no 2 pp 197ndash203 2011

[13] B Zhou Y Peng and D Hwang ldquoPipeline FFT architecturesoptimized for FPGAsrdquo International Journal of ReconfigurableComputing vol 2009 Article ID 219140 9 pages 2009

[14] V N Sudheer and V B Gopal ldquoFPGA implementation of64 point FFT processorrdquo International Journal of InnovativeTechnology and Exploring Engineering vol 1 no 4 pp 63ndash662012

[15] V Gautam K C Ray and P Haddow ldquoHardware efficientdesign of variable length FFT processorrdquo in Proceedings of the14th IEEE International Symposium onDesign andDiagnostics ofElectronic Circuits and Systems (DDECS rsquo11) pp 309ndash312 IEEECottbus Germany April 2011

[16] F Camarda J-C Prevolet and F Nouvel ldquoTowards a recon-figurable FFT application to digital communication systemsrdquoin Fourier TransformsmdashNew Analytical Approaches and FTIRStrategies G S Nikolic Ed chapter 10 pp 185ndash202 InTechRijeka Croatia 2011

[17] R Bhakthavatchalu M Viswanath P M Venugopal and RAnju ldquoProgrammable N-point FFT designrdquo Lecture Notes onSoftware Engineering vol 2 no 3 pp 247ndash250 2014

[18] Y Ouerhani M Jridi and A Alfalou ldquoImplementation tech-niques of high-order FFT into low-cost FPGArdquo in Proceedingsof the 54th IEEE International Midwest Symposium on Circuitsand Systems (MWSCAS rsquo11) pp 1ndash4 IEEE Seoul Republic ofKorea August 2011

[19] Xilinx Product Specification High Perfomance 64-Point Com-plex FFTIFFT V70 Xilinx San Jose Calif USA 2009httpwwwxilinxcomipcenter

[20] J Grajal M A Sanchez M Garrido and O Gustafs-son ldquoPipelined radix-2k feedforward FFT architecturesrdquo IEEETransactions on Very Large Scale Integration (VLSI) Systems vol21 no 1 pp 23ndash32 2013

[21] E Lee and DMesserschmitt ldquoSynchronous data flowrdquo Proceed-ings of the IEEE vol 75 no 9 pp 1235ndash1245 1987


[22] S Edwards L Lavagno E A Lee and A Sangiovanni-Vincentelli ldquoDesign of embedded systems formal modelsvalidation and synthesisrdquo Proceedings of the IEEE vol 85 no3 pp 366ndash389 1997

[23] System Generator for DSP User Guide UG640 (V 143) XilinxSan Jose Calif USA 2012 httpwwwxilinxcom

[24] P G Paulin and J P Knight ldquoForcemdashdirected sheduling for thebehavioral synthesis of ASICsrdquo IEEETransactions onComputer-Aided Design of Integrated Circuits and Systems vol 7 no 3 pp356ndash370 1988

[25] P Micheli U Lauther and P Duzy EdsThe Systhesis Approachto Digital System Design Kluwer Academic Publishers 1992

[26] K Keshab K K Parhi and Y Chen ldquoSignal flow graphs anddata flow graphsrdquo in Handbook of Signal Processing Systems SS Bhattacharyya E F Deprettere and R Leupers Eds pp 791ndash816 Springer New York NY USA 2010

[27] Y Robert and F Vivien Eds Introduction to Scheduling CRCPress Taylor ampFrancis Group 2010

[28] O Maslennikow and A Sergiyenko ldquoMapping DSP algorithmsinto FPGArdquo in Proceedings of the International Symposium onParallel Computing in Electrical Engineering (PARELEC rsquo06) pp208ndash213 IEEE Bialystok Poland September 2006

[29] AM Sergiyenko OMaslennikow andY Vinogradow ldquoTensorapproach to the application specific processor designrdquo in Pro-ceedings of the 10th International ConferencemdashThe Experience ofDesigning and Application of CAD Systems in Microelectronics(CADSM rsquo09) pp 146ndash149 IEEE Library Lviv Ukraine Febru-ary 2009

[30] AM SergiyenkoVHDL forDesign of Computers DiaSoft KievUkraine 2003 (Russian)

[31] R Hartley and K K Parhi Digit-Serial Computation vol 306Springer New York NY USA 1995

[32] A SergiyenkoV Simonenko YVinogradov andKGluchenkoldquoIP core synthesis in a cloudrdquo in Proceedings of the 3rd Interna-tional Conference ldquoHigh PerformanceComputingrdquo (HPC-UA rsquo13)pp 350ndash353 Kiev Ukraine October 2013 httphpc-uaorghpc-ua-13filesproceedings66pdf

[33] A Sergiyenko T Lesyk and O Maslennikow ldquoMapping DSPalgorithms into FPGArdquo in Proceedings of IEEE East-West Designamp Test Symposium (EWDTS rsquo08) pp 343ndash348 KNURE LvivUkraine October 2008

[34] S A Khan Digital Design of Signal Processing Systems APractical Approach John Wiley amp Sons New York NY USA2011

[35] Synopsys Synthesis and Simulation Design Guide Ver 21iXilinx San Jose Calif USA 1999

[36] A Sergiyenko and O Usenkov ldquoPipelined FFTIFFT 128points processorrdquo 2010 httpwwwopencoresorgprojectpipelined fft 128

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



pipelined FFT processor which contains 4261 ALMs andonly 24 multipliers [8]

The implementation of the pipelined FFT architecture inmodern field programmable gate arrays (FPGAs) providesthe high-speed hardware solution with small energy con-sumption One FFT of 256 16-bit complex points dissipatesapproximately one microjoule in FPGA [6] The FFT pro-cessor for 119873 = 4096 which occupies 367 k ALMs and 60multipliers provides the speed up to 90 GFLOPS and theefficiency near 10 GFOPS per watt which is in many timeshigher than in CPU or GPU implementation [9] Besidesthis architecture can be accommodated in FPGA to the solvedproblem conditions by exchanging the throughput transformlength119873 or computation precision

The papers [10ndash12] describe the design and implementa-tion of radix-22 single-path delay feedback pipelined FFT

In most cases the power-of-two FFT processors areimplemented in FPGA In the paper [13] it was proven thatRadix-2 FFT method provides the least number of FPGAslices the Good-Thomasmethod is faster than Cooley-Tukeyand the Rader method had the lowest operating frequency ofthe pipelined processor in FPGA

The non-power-of-two transforms are widely used in theOFDMmodems In such transform the Winograd algorithmminimizes the number of multiplications in the DFT mod-ules but also adds a degree of complexity and significantlyincreases the total number of utilized adders in FPGA [14 15]

In [16] the parallel architecture of the DFT module hasbeen proposed for the computation of this algorithm Thisarchitecture is able to deal with a large amount of FFT sizesdecomposable in product factors that are 2 3 4 5 7 or 8

In [17] the pipelined processor design is proposed whichuses the Cooley-Tukey FFT algorithm for FFT computationonly in those cases where the factors of the number119873 are notrelatively prime

The DFT modules which are used in the examples ofthe pipelined FFT processors are designed by the one-to-one mapping of the respective small point FFT algorithmsAs a result they need the data feeding through 119903 input portsConsider the two stage pipeline the number119873 is factored tofactors119901 119903 119901 = 119903Then the buffer has 119903 FIFOs of the length ofmore than 119901 which are fed from 119901 inputs in the nonuniformorder Therefore to provide the proper input data order forthese stages the complex data buffers must be attached totheir ports

Consider the DFT modules which accept the 119903 inputdata sequentially for 119903 steps Then both data buffers andtwiddle factor multipliers are simplified substantially Thesemodules have the slowdown operation in 119903 times Besides thehardware volume of the DFTmodules can be decreased up to119903 times The disadvantage of this architecture is the decreaseof the FFT processor throughput up to 119903 times But in thissituationwe can provide the proper system throughput by theincrease of the FFT processors number which are configuredin FPGA

In this paperwe propose the design of a set of 119903-pointDFTunits which help to implement the pipelined FFT processorswhen the data flow is a single sample per a clock cycle

2 A Method of Pipelined Datapath Synthesis

By the high level synthesis the DSP algorithm is usuallydescribed by a signal flow graph or a synchronous data flow(SDF) In SDF the nodes-actors and edges represent theoperators and data transfers between them respectively Eachactor consumes and generates the same amount of data ineach SDF cycle [21 22]

Uniform SDF has the property that its graph is equal tothe graph of the pipelined datapath which implements thealgorithm with the period of 119871 = 1 clock cycle Then the SDFnodes are mapped into the operational resources like addersmultipliers the edges are mapped into the connections andthe delays in edges represent the pipelined registers Thisproperty is widely used by the synthesis of DSP modulesfor FPGA in many CAD tools like Matlab-Simulink SystemGenerator [23]

The synthesis of the pipelined datapath with the periodof 119871 gt 1 cycles is usually performed by the steps of resourceselection actor scheduling and resource assignment Thenthe datapath structure is found and the control unit issynthesized

It is worth mentioning that most of DFT algorithmsare acyclic ones The most popular scheduling methods forlimited resources and execution time consider the acyclicSDF These methods are list scheduling and force directedscheduling [24] The register allocation is effectively imple-mented by the Tseng heuristic and by the left edge schedulingThe use of the cyclic interval graph takes into accountthe cyclic nature of the SDF algorithm [25] The retimingmethods and the graph folding methods simplify the SDFmapping [26 27]

In [28 29] the method of the datapath synthesis isproposed which is based on SDF This method adapted tothe acyclic SDF is described below In this method SDF isrepresented in the three-dimensional space in the form ofa triple 119870

119866= (119870119863119860) where 119870 is the matrix of vectors-

nodes K119894 which mean the operators or actors 119863 is matrix

of vectors-edgesD119895 performing the links between operators

and 119860 is the incidence matrix of SDFIn the vector K

119894= (119896119894 119904119894 119905119894)119879 the coordinates 119896

119894 119904119894 and

119905119894correspond to the type of operator the processor unit

(PU) number and the clock cycle The SDF graph in suchrepresentation is called spatial SDF

Spatial SDF is split into the spatial configuration 119870119866119878

=

(119870119878 119863119878 119860) and event configuration 119870

119866119879= (119870

119879 119863119879 119860)

which correspond to the datapath structure and its scheduleBy the splitting process the vectors K

119894= (119896

119894 119904119894 119905119894)119879 are

decomposed into vectors K119878119894= (119896119894 119904119894)119879 corresponding to

the PU coordinates and vectors K119879119894

= 119905119894 which mean the

execution times of the relevant operators in PUK119878119894 Then the

temporal componentD119879119894= 119905119894of the vector D

119894is equal to the

delay of transfer or processing of the respective variableWe can assume that the matrix 119870 encodes some accept-

able structural solution since the matrix119863 is calculated by

119863 = 119870119860 (1)

The structural optimization consists in finding suchmatrix 119870 which minimizes a given quality criterion It is





119870 = 119863119874119860119874

minus1 (2)


119874is the inci-



(K119894= K119895 119894 = 119895) (3)



(119896119894= 119896119895 119904119894= 119904119895) 997904rArr 119905

119894equiv 119905119895mod 119871 (4)


forallD119895

(119905119894ge 0) (5)


K119894K119895isin 119870119901119902

(119896119894= 119896119895= 119901 119904

119894= 119904119895= 119902)

10038161003816100381610038161003816119870119901 119902

10038161003816100381610038161003816le 119871

(6)

where 119870119901119902









119894isin 119863119874 and











119905 = 1199091+ 1199092

119901 = 1199092minus 1199091

1198980= 1199090+ 119905

1198981= (cos 2120587

3


1198982= 119895 sdot sin 2120587

3

sdot 119901

119904 = 1198980+ 1198981

1198830= 1198980

1198831= 119904 + 119898

2

1198832= 119904 minus 119898

2

(7)


0 1198831 1198832







minus 10 00




+ ((1199012minus2minus 119901) 2


minus10

(8)



j

t

p

j

s

minus

minusminus

minus

minus

minus

x0

x2

x1

m0

m2

minusm1

X0

X2

X1

≫1≫2

≫2 ≫10

≫3

≫4


0 1 2 0 1 2 0 1 t mod 3

t

s

p

In

Out

j j

s

t1 2 3 4 5 6 70

X0 X2X1

minus

minus

minusminus

minus

minus

x0x2x1

m0

m2

≫1

≫2

≫2

≫10≫3

≫4

S1

S2

R2

S3

S4

S5

R1

S6





MUX

In

R1

S1RG

S2RG

MUXS3

RG

MUX

MUXS4

RG

MUXR2

MUX

MUXS5

RG

MUX

S6RG

Out

j

minus

minus minus

minus

≫1

≫2

≫2

≫10≫3

≫4

plusmn
















elseCYC lt=CYC +1

end ifend if





when others=gt




end synt

Algorithm 1




3 245 6403 201 + 2 4334 215 5485 945 4358 1187 42415 = 3 sdot 5 2131 34616 3616 36864 = 8 sdot 8 1985 + 4 338128 = 8 sdot 16 5277 + 4 324









Virtex-5 695 + 24 384 [20]628 + 4 325 Proposed


5 Conclusions



Competing Interests


References








































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation













119870 = 119863119874119860119874

minus1 (2)


119874is the inci-



(K119894= K119895 119894 = 119895) (3)



(119896119894= 119896119895 119904119894= 119904119895) 997904rArr 119905

119894equiv 119905119895mod 119871 (4)


forallD119895

(119905119894ge 0) (5)


K119894K119895isin 119870119901119902

(119896119894= 119896119895= 119901 119904

119894= 119904119895= 119902)

10038161003816100381610038161003816119870119901 119902

10038161003816100381610038161003816le 119871

(6)

where 119870119901119902









119894isin 119863119874 and











119905 = 1199091+ 1199092

119901 = 1199092minus 1199091

1198980= 1199090+ 119905

1198981= (cos 2120587

3


1198982= 119895 sdot sin 2120587

3

sdot 119901

119904 = 1198980+ 1198981

1198830= 1198980

1198831= 119904 + 119898

2

1198832= 119904 minus 119898

2

(7)


0 1198831 1198832







minus 10 00




+ ((1199012minus2minus 119901) 2


minus10

(8)



j

t

p

j

s

minus

minusminus

minus

minus

minus

x0

x2

x1

m0

m2

minusm1

X0

X2

X1

≫1≫2

≫2 ≫10

≫3

≫4


0 1 2 0 1 2 0 1 t mod 3

t

s

p

In

Out

j j

s

t1 2 3 4 5 6 70

X0 X2X1

minus

minus

minusminus

minus

minus

x0x2x1

m0

m2

≫1

≫2

≫2

≫10≫3

≫4

S1

S2

R2

S3

S4

S5

R1

S6





MUX

In

R1

S1RG

S2RG

MUXS3

RG

MUX

MUXS4

RG

MUXR2

MUX

MUXS5

RG

MUX

S6RG

Out

j

minus

minus minus

minus

≫1

≫2

≫2

≫10≫3

≫4

plusmn
















elseCYC lt=CYC +1

end ifend if





when others=gt




end synt

Algorithm 1




3 245 6403 201 + 2 4334 215 5485 945 4358 1187 42415 = 3 sdot 5 2131 34616 3616 36864 = 8 sdot 8 1985 + 4 338128 = 8 sdot 16 5277 + 4 324









Virtex-5 695 + 24 384 [20]628 + 4 325 Proposed


5 Conclusions



Competing Interests


References








































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










j

t

p

j

s

minus

minusminus

minus

minus

minus

x0

x2

x1

m0

m2

minusm1

X0

X2

X1

≫1≫2

≫2 ≫10

≫3

≫4


0 1 2 0 1 2 0 1 t mod 3

t

s

p

In

Out

j j

s

t1 2 3 4 5 6 70

X0 X2X1

minus

minus

minusminus

minus

minus

x0x2x1

m0

m2

≫1

≫2

≫2

≫10≫3

≫4

S1

S2

R2

S3

S4

S5

R1

S6





MUX

In

R1

S1RG

S2RG

MUXS3

RG

MUX

MUXS4

RG

MUXR2

MUX

MUXS5

RG

MUX

S6RG

Out

j

minus

minus minus

minus

≫1

≫2

≫2

≫10≫3

≫4

plusmn
















elseCYC lt=CYC +1

end ifend if





when others=gt




end synt

Algorithm 1




3 245 6403 201 + 2 4334 215 5485 945 4358 1187 42415 = 3 sdot 5 2131 34616 3616 36864 = 8 sdot 8 1985 + 4 338128 = 8 sdot 16 5277 + 4 324









Virtex-5 695 + 24 384 [20]628 + 4 325 Proposed


5 Conclusions



Competing Interests


References








































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation
















elseCYC lt=CYC +1

end ifend if





when others=gt




end synt

Algorithm 1




3 245 6403 201 + 2 4334 215 5485 945 4358 1187 42415 = 3 sdot 5 2131 34616 3616 36864 = 8 sdot 8 1985 + 4 338128 = 8 sdot 16 5277 + 4 324









Virtex-5 695 + 24 384 [20]628 + 4 325 Proposed


5 Conclusions



Competing Interests


References








































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation














Virtex-5 695 + 24 384 [20]628 + 4 325 Proposed


5 Conclusions



Competing Interests


References








































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation



























RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









research article modules for pipelined mixed radix...

Documents