implementing the fft on gpus tips & trickscomputer-graphics.se/multicore/pdf/2013/12a fft on...

Post on 14-Sep-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

LINKÖPING UNIVERSITY

Linköping, 2013

IMPLEMENTING THE FFT ON GPUs

TIPS & TRICKS

Department of Electrical

Engineering

Mario Garrido Gálvez mariog@isy.liu.se

2

MARIO GARRIDO

Associate Professor at ES, ISY

PhD in Electrical Engineering (Spain).

Research background:

Optimized implementation of signal processing algorithms.

Transforms (FFT, STFT,…), statistical operations

(regressions, median filter,…).

Data management (matrix transposition, interleavers,…).

Hardware designer (FPGAs, ASICs,…).

3

A STORY ABOUT GPUs

4

< 2011

Mainly FFTs on FPGAs.

Hundreds of papers in the topic since the 70’s.

Is not everything done???

OPTIMIZE

OPTIMIZE

OPTIMIZE

5

DFT / FFT

Discrete Fourier Transform / Fast Fourier Transform.

The most widely used algorithm in signal processing

- Audio and Image Processing. - 3G, 4G.

- Medical applications: EEG, ECG. - ADSL.

6

…,2011,…

FFT FPGA FFT FFT FPGA FPGA FFT FPGA

FPGA FFT FPGA FFT FFT FPGA FPGA FFT…

I should do something new!!

What about GPUs?…Shouldn’t it be the same…

7

FPGA vs GPU

FPGA GPU

Altera Cyclone II NVIDIA Fermi

8

…,2011,…

Started Master Thesis (Sreehari Ambuluri): FFTs on GPUs.

Read articles and a book on GPUs.

Asked Ingemar, Jens, Gabriel.

9

…,2012,…

Finish the Master Thesis.

The work is good.

Why not to improve it and

publish a paper?

Asked Ingemar, Jens and

Gabriel for collaboration.

10

…,2013,…

11

…,2013 BEST PAPER AWARD

12

NVIDIA

NVIDIA WE

13

LEVEL OF ABSTRACTION

The compiler decides We decide

COMPILATION

COMPILATION

DESCRIPTION

OPTIMIZATION

&

DESCRIPTION

ALGORITHM ALGORITHM =

IMPLEMENTATION IMPLEMENTATION <

DE

SIG

N P

RO

CE

SS

High level of abstraction Low level of abstraction

14

ABSTRACTION vs PERFORMANCE

z = 15 * x;

z = (x<<3)+(x<<2)+(x<<1) +x;

z = (x<<4)-x;

x z

15

x << 3

z

<< 2

<< 1

x << 4

z

LANGUAGE

DESCRIPTION HARDWARE

IMPLEMENTATION

15

UNDERSTANDING GPUs

1.- The performance is related to the computation time. The lower

the computation time, the higher the performance. Try to simplify

the operations in the algorithm.

2.- Transactions to global memory very expensive. Try to avoid or

minimize. Try to use shared memory.

3.- Threads must be synchronized if we want to share information

among them. Unless they are in the same warp. Try to reduce the

number of synchronization points.

4.- We have to calculate the index of the data processed by each

thread. Try to minimize the number of index calculations.

5.- Threads process data in parallel and the synchronization is not

possible until all the threads have finished the calculations. Balance

the load among thread.

FFT FLOW GRAPH (RADIX -2)

4 03

7

6

5

4

0

07

15

11

2

3

1

0

0

4

0

0

0

0

0

0

4

6 4

0

2

0 0

0

5

13

1

9

6

14

2

10

0

0

0

0

0

0

4

0

0

0 0

0

4

12

0

8

14

15

13

12

10

11

8

9

7

6

1

5

4

3

2

0

STAGE 1 STAGE 2 STAGE 3 STAGE 4

4

6

2

0

0

0

0

0

x[n]

n

X[k]

k

logrN

stages

butterflies

(radix-2)

rotations

Nj

e

2

N points

16

1. SIMPIFY THE ALGORITHM

17

USE RADIX-22

2. USE SHARED MEMORY

18

3. REDUCE SYNC. POINTS 2-word group 4-word group

USE WORD GROUPS

19

4. REDUCE INDEX CALCULATIONS

USE CONSTANT GEOMETRY

Conventional flow graph Constant Geometry

20

5. BALANCE LOAD AMONG THREADS

USE SCHEDULING

Unbalanced scheduling Balanced scheduling

21

CONCLUSIONS

22

Optimization:

- Depends on the details and the level of abstraction.

- Requires to understand in-depth what you are doing.

Teamwork makes a difference.

GPUs are fun.

23

top related