acceleration of cooley-tukey algorithm using maxeler machine

Post on 30-Dec-2015

43 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Acceleration of Cooley-Tukey algorithm using Maxeler machine. Author : Nemanja Trifunović Mentor : Profe s sor dr. Veljko Milutinović. Introduction. Cooley-Tukey algorit h m Fast Fourier Transform Divide and conquer - PowerPoint PPT Presentation

TRANSCRIPT

Acceleration of Cooley-Tukey algorithmusing Maxeler machine

Author: Nemanja Trifunović Mentor: Professor dr. Veljko Milutinović

Introduction

● Cooley-Tukey algorithm○ Fast Fourier Transform○ Divide and conquer○ Uses: Digital Signal Processing,

Telecommunications, The analysis of sound signals, …

● Maxeler platform○ Data flow

(vs Control flow)○ FPGA

Example of Fourier transformation.

(Source: https://en.wikipedia.org/wiki/File:Rectangular_function.svg; https://en.wikipedia.org/wiki/File:Sinc_function_(normalized).svg, Illustration is published under Creative Commons licencom)

1/22

Problem statement

Design and implementation of:

● The fastest possible system for calculating Fast Fourier Transform using Maxeler machine.

● System that will outperform currently existing solutions to this problem.

2/22

Problem statement

Benefits

● Higher speed of calculation.

● Lower power consumption.● Lower space consumption.

Conditions

● Huge amounts of data.

• Benefits of calculating Fast Fourier Transformwith Maxeler machines

3/22

Conditions and assumptions

● Used Maxeler machine○ Two Maxeler card

type MAX3424A.

● In experiments with multiprocessor systems only one processor core was used.

4/22

Overview of existing solutions

● FFT algorithms: Prime-factor, Bruun’s, Rader’s, Winograd, Bluestein’s, …

● The time complexity: O(N log N).

● Performance comparisonof publicly available implementations.

○ Matteo Frigo and Steven G. Johnson (from MIT)

5/22

Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments. (Soruce: http://www.fftw.org/speed/Pentium4-3.60GHz-icc)

6/22

The proposed solution

● Parallelized radix 2 algorithm.

● Pipeline of depth O(log N), where N is the length of input sequence.

● Latency is proportional to the depth of pipeline.

● After initial delay (latency) one result in every cycle.

7/22

Formal analysis

Radix 2 Cooley-Tukey algorithmoperates as follows:

1. Input sequence is divided into two equal subsequences where even elements make first, while the odd elements make second sequence.

2. Then, using the calculated DFT's of subsequences DFT of the whole sequence is calculated.

8/22

Formal analysisDetailed derivation of the following formula is given it the paper

● DFT of even sequence is denoted by Ek,

● DFT of odd sequence is denoted by a Ok and

● e-2πk/N is denoted by Wkn.

9/22

Illustration of pipelined execution of radix 2 algorithm. 10/22

Measurment and analysis of the performance of proposed implementation

Types of performed experiments

● Calculation of Fourier transformof 100, 1.000, 10.000, 1.000.000 and 10.000.000 consecutive input sequencesof length 8, 16, 32 i 64 points.

● Maxeler implementationvs reference CPU implementation

● Maxeler implementationvs best publicly available implementations

11/22

Generated graphs:

● Maxeler vs best publicly available implementations of FFT algorithm.

● Run-times, depending on the number of consecutive FFT calculations(for input sequences of length 8, 16, 32 and 64).

● Acceleration obtained using Maxeler machine, compared to the CPU execution,depending on the number of consecutive FFT calculations(for input sequences of length 8, 16, 32 and 64).

12/22

The average execution time in seconds of publicly available algorithms for calculating FFT on different architectures

for input sequence of 8 elements. 13/22

Acceleration of Maxeler implementation compared to CPU implementation depending on the number of elements in the input sequence .

14/22

Computation time of consecutive fast Fourier transforms expressed in seconds depending on the number of consecutive calculations.

15/22

Acceleration of Maxeler implementation compared to CPU implementation depending on the number of consecutive calculations.

.

16/22

Analysis of scalability and bottlenecks of proposed solution

● Transfer of data to Maxeler cardand from Maxeler card

● Limited number of hardware resources on single Maxeler card

● Limited number of Maxeler cards

17/22

Analysis of implementation

Maxeler implementation of Cooley-Tukey algorithm consists of:

1. Rearrangement of the input sequencein bit reverse order and

2. Radix 2 algorithm.

18/22

Illustration of the kernel19/22

Implementation details

● Two input and two output streams ● These streams are of type: arrayType

DFEType floatType = dfeFloat(8, 24);DFEArrayType<DFEVar> arrayType =

new DFEArrayType<DFEVar>(floatType, n);

● Ratios Wnk aren’t calculated on Maxeler machine

● Parameters:○ N○ first_level○ last_level

20/22

Conclusion

➔ It’s show that proposed solutionhas expected performance and that it works correctly.

➔ Performance of the proposed solutionis better than performance ofany publicly available implementation of Fast Fourier Transform.

➔ To achieve these speedups it is needed to do consecutive calculations of Fast Fourier Transform

21/22

Q/AThank you for attention

top related