david hansen and james michelussi. introduction discrete fourier transform (dft) fast fourier...

David Hansen and James Michelussi

Introduction

Discrete Fourier Transform (DFT) Fast Fourier Transform (FFT) FFT Algorithm – Applying the

Mathematics Implementations of DFT and FFT Hardware Benchmarks Conclusion

DFT

In 1807 introduced by Jean Baptiste Joseph Fourier. allows a sampled or discrete signal that is periodic to

be transformed from the time domain to the frequency domain

Correlation between the time domain signal and N cosine and N sine waves

e

e

N

j

N

N

n

N

n

N

nkj

W

N

nkj

N

nknx

Nnx

NkX

2

1

0

1

0

2 2sin

2cos)(

1)(

1)(

X(k) = DFT Frequency SignalN = Number of Sample PointsX(n) = Time Domain SignalWN = Twiddle Factor

DFT (Walking Speed)

Why is this important? Where is this used? allows machines to calculate the frequency

domain allows for the convolution of signals by just

multiplying them together Used in digital spectral analysis for speech,

imaging and pattern recognition as well as signal manipulation using filters

But the DFT requires N2 multiplications!

FFT (Jet Speed)

J. W. Cooley and J. W. Tukey are given credit for bringing the FFT to the world in the 1960s

Simply an algorithm for more efficiently calculating the DFT Takes advantage of symmetry and periodicity in the twiddle

factors as well as uses a divide and conquer method Symmetry: WN

r +N/2 = -WNr

Periodicity: WNr+N = WN

r

Requires only (N/2)log2(N) multiplications ! Faster computation times More precise results due to less round-off error

FFT Algorithm

Several different types of FFT Algorithms (Radix-2, Radix-4, DIT & DIF)

Focus on Radix-2 using Decimation in Time (DIT) method Breaks down the DFT calculation into a number of 2-

point DFTs Each 2-point DFT uses an operation called the

Butterfly These groups are then re-combined with another

group of two and so on for log2(N) stages Using the DIT method the input time domain points

must be reordered using bit reversal

Butterfly Operation

e N

j

NW2

Bit Reversal

8-Point Radix-2 FFT Example

David Hansen

Implementations of DFT and FFT

DFT Implementation

Nested For Loop, (N/2)*N Iterations… O(N2) 63027.41 Cycles / Sample (123 cycles per inner

loop iteration) Obvious Inefficiencies, cos and sin math.h functions Efficient assembly coding could reduce the inner

loop to 3 cycles per iteration (1,536 cycles / sample)

for (r=0; r<=samples/2; r++){

float re = 0.0f, im = 0.0f;float part = (float)r * -2.0f * PI / (float)samples;

for (k=0; k<samples; k++){

float theta = part * (float)k;re += data_in[k] * cos(theta);im += data_in[k] * sin(theta);

}}

C++ FFT Implementationvoid fft_float (unsigned NumSamples, float *RealIn, float *ImagIn, float *RealOut, float *ImagOut ){ for ( i=0; i < NumSamples; i++ ) { // Iterate over the samples and perform the bit-reversal j = ReverseBits ( i, NumBits ); } BlockEnd = 1;

// Following loop iterates Log2(NumSamples) for ( BlockSize = 2; BlockSize <= NumSamples; BlockSize <<= 1 ) { // Perform Angle Calculations (Using math.h sin/cos)

// Following 2 loops iterate over NumSamples/2 for ( i=0; i < NumSamples; i += BlockSize ) { for ( j=i, n=0; n < BlockEnd; j++, n++ ) {

// Perform butterfly calculations } }

BlockEnd = BlockSize; }}

C++ FFT Implementation

Bit-Reverse For Loop – N iterations Nested For Loops

First Outer Loop – Log2(N) iterations Made use of sin/cos math.h functions

Second Outer Loop – N / BlockSize iterations Inner Loop – BlockSize/2 iterations

O(N + Log2(N) * N/BlockSize * BlockSize/2) O(N+N*Log2(N))

193.84 Cycles / Sample

Assembly FFT Implementation Bit-Reverse Address Generation

Hide Bit-Reverse operation inside first and second FFT Stages

Sin and Cos values stored in a Look-Up-Table 256 Kbyte LUT added to Data1

Needed to grow Data1 Memory Space using LDF file

Interleaved Real and Imaginary Arrays Quad Reads Loads 2 Complex Points per Cycle

Supports the Real FFT for input signals with no Imaginary component 40% Algorithm-based Savings

Assembly FFT Implementation Special Butterfly Instruction

Can perform addition/subtraction in parallel in one compute block

Speeds up the inner-most loop VLIW and SIMD Operations

Performs simultaneous operations in both compute blocks

Loop unrolling and instruction scheduling keeps the entire processor busy with instructions.

11.35 Cycles per Sample

Assembly FFT Implementation_BflyLoop: q[j2+=4]=r27:26; k5=k5+k9; fr6=r30*r12; fr16=r6-r7;;

yr3:0=q[j0+=4]; k3=k5 and k4; fr15=r23*r4; fr24=r8+r18, fr26=r8-r18;; xr3:0=q[j0+=4]; r5:4=l[k7+k3]; fr7=r31*r13; fr25=r9+r19, fr27=r9-r19;; q[j1+=4]=r25:24; fr14=r30*r13; fr17=r14+r15;; q[j2+=4]=r27:26; k5=k5+k9; fr6=r2*r4; fr18=r6-r7;;

yr11:8=q[j0+=4]; k3=k5 and k4; fr15=r31*r12; fr24=r20+r16, fr26=r20-r16;; xr11:8=q[j0+=4]; r13:12=l[k7+k3]; fr7=r3*r5; fr25=r21+r17, fr27=r21-r17;; q[j1+=4]=r25:24; fr14=r2*r5; fr19=r14+r15;; q[j2+=4]=r27:26; k5=k5+k9; fr6=r10*r12; fr16=r6-r7;;

yr23:20=q[j0+=4]; k3=k5 and k4; fr15=r3*r4; fr24=r28+r18, fr26=r28-r18;; xr23:20=q[j0+=4]; r5:4=l[k7+k3]; fr7=r11*r13; fr25=r29+r19, fr27=r29-r19;; q[j1+=4]=r25:24; fr14=r10*r13; fr17=r14+r15;; q[j2+=4]=r27:26; k5=k5+k9; fr6=r22*r4; fr18=r6-r7;; yr31:28=q[j0+=4]; k3=k5 and k4; fr15=r11*r12; fr24=r0+r16, fr26=r0-r16;; xr31:28=q[j0+=4]; r13:12=l[k7+k3]; fr7=r23*r5; fr25=r1+r17, fr27=r1-r17;; .align_code 4; if NLC0E, jump _BflyLoop;

DC FFT Test

FFT Source Array FFT Output Magnitude

Audio FFT Test

FFT Source Array FFT Output Magnitude

1024 Point DFT / FFT Comparison

Implementation Cycles Per Sample

DFT Implemented in C 63,027.41 cycles / sample

DFT Implemented in Assembly

1,536 cycles / sample

FFT Implemented in C 193.85 cycles / sample

FFT Implemented in Assembly

11.35 cycles / sample

1024 Point Radix-2 FFT Hardware Comparison

Processor Architecture

Cycles Per Sample

Processor Frequency

Execution Time

ADSP-21369 (SHARC)8.98 cycles /

sample400 MHz 22.99 µSec

TigerSHARC (website)9.16 cycles /


TigerSHARC (our results)

11.35 cycles / sample

600 MHz 19.37 µSec

TMS320C6000™14.125 cycles /


TMS320DM644x™7.59 cycles /


Conclusion

The FFT algorithm is very useful when computing the frequency domain on a DSP.

FFT is much faster than a regular DFT algorithm FFT is more precise by having less errors created

due to round off. The timed coding examples further support this

claim and demonstrate how to code the algorithm.

The Radix-2 FFT isn’t the fastest but it uses a less complex addressing and twiddle factor routine

In this case (unlike in school) F is better then D.

david hansen and james michelussi. introduction discrete fourier transform (dft) fast fourier...

Documents

fft slide

dft frequency signal

n multiplications

n cosine

n stages

n iterations

fft fft algorithm

reversal slide