high performance discrete fourier transforms on graphics processors

45
High Performance Discrete Fourier Transforms on Graphics Processors Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith , John Manferdelli Microsoft Corporation

Upload: farrell-foyle

Post on 31-Dec-2015

42 views

Category:

Documents


0 download

DESCRIPTION

High Performance Discrete Fourier Transforms on Graphics Processors. Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith , John Manferdelli Microsoft Corporation. Discrete Fourier Transforms (DFTs). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Performance Discrete Fourier Transforms  on  Graphics Processors

High Performance Discrete Fourier Transforms on

Graphics Processors

Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko,Burton Smith , John Manferdelli

Microsoft Corporation

Page 2: High Performance Discrete Fourier Transforms  on  Graphics Processors

Discrete Fourier Transforms (DFTs)

• Given an input signal of N values f(n), project it onto a basis of complex exponentials

– Often computed using Fast Fourier Transforms (FFTs) for efficiency

• Fundamental primitive for signal processing– Convolutions, cryptography, computational fluid dynamics,

large polynomial multiplications, image and audio processing, etc.

• A popular HPC benchmark– HPC Challenge benchmark– NAS parallel benchmarks

1

0

/2)()(N

n

NiknenfkF

Page 3: High Performance Discrete Fourier Transforms  on  Graphics Processors

DFT: Challenges

• HPC Challenge 2008– DFT on Cray XT3: 0.9 TFLOPS– HPL: 17 TFLOPS

• Complex memory access patterns– Limited data reuse– For a balanced system, if compute-to-memory

ratio doubles, the cache size needs to be squared for the system to be balanced again [Kung86]

• Architectural issues– Cache associativity, memory banks

Page 4: High Performance Discrete Fourier Transforms  on  Graphics Processors

4

GPU: Commodity Processor

Cell phones Consoles

PSPDesktops

Page 5: High Performance Discrete Fourier Transforms  on  Graphics Processors

Parallelism in GPUs

GPU MemoryDRAM DRAM DRAM DRAM DRAM DRAM

TPC TPC TPC TPC TPC

DRAM DRAM

TPC TPC TPC TPC TPC

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

Page 6: High Performance Discrete Fourier Transforms  on  Graphics Processors

GPU Memory

Domain

Programmability

GPU MemoryDRAM DRAM DRAM DRAM DRAM DRAM

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPCLo

cal M

emor

ySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

TPC

Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP Loca

l Mem

orySP

SP

SP

SP

SP

SP

SP

SP

Thread Execution Manager

Thread Block

loca

l mem

ory

RegsRegs

Regs

Thread Block

loca

l mem

ory

RegsRegs

RegsHigh-level programming abstractions:

Microsoft DirectX11, OpenCL, NVIDIA CUDA, AMD CAL, etc.

Page 7: High Performance Discrete Fourier Transforms  on  Graphics Processors

Discrete Fourier Transforms

• Objectives:– Efficiency: Achieve high performance exploiting the

memory hierarchy and high parallelism– Accuracy: Design algorithms that achieve

comparable numerical accuracy with CPU libraries– Scalability: Demonstrate scalable performance based

on underlying hardware capabilities• Focus on computing single-precision DFTs that fit

in GPU memory– Demonstrate DFT performance of 100-300 GFLOPS

per GPU for typical large sizes– Concepts applicable to double-precision algorithms

Page 8: High Performance Discrete Fourier Transforms  on  Graphics Processors

FFT Overview

Page 9: High Performance Discrete Fourier Transforms  on  Graphics Processors

FFT Overview

FFT along columnsFFT along rows

Transpose

Page 10: High Performance Discrete Fourier Transforms  on  Graphics Processors

Registers (16K)

Globalmemory (1GB)

Shared memory(16KB/multi-processor)

Significant literature on FFT algorithms. Detailed survey in [Van Loan 92]

Page 11: High Performance Discrete Fourier Transforms  on  Graphics Processors

DFTs on GPUs: Challenges

• Coalescing issues– Access contiguous blocks of data to achieve high

DRAM bandwidth

• Bank conflicts– Affine access patterns can map to same banks

• Transpose overheads– Reduce memory access overheads

• Occupancy– Require several threads to hide memory latency

Page 12: High Performance Discrete Fourier Transforms  on  Graphics Processors

Outline

• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms

• Experimental Results• Conclusions and Future Work

Page 13: High Performance Discrete Fourier Transforms  on  Graphics Processors

Outline

• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms

• Experimental Results• Conclusions and Future Work

Page 14: High Performance Discrete Fourier Transforms  on  Graphics Processors

Overview

• Global Memory Algorithm– Large N– Uses high memory bandwidth of GPUs

• Shared Memory Algorithm– Small N– Data re-use in shared memory of GPU MPs

• Hierarchical Algorithm– Intermediate sizes– Combines data transposes with shared memory

algorithm

Page 15: High Performance Discrete Fourier Transforms  on  Graphics Processors

Global memory algorithm

• Proceeds in logRN steps (radix=R)

• Decompose N into blocks B, and threads T such that B*T=N/R

• Each thread:– reads R values from global memory– multiplies by twiddle factors– performs an R-point FFT– writes R values back to global memory

Page 16: High Performance Discrete Fourier Transforms  on  Graphics Processors

Global Memory Algorithm

Thread 0 Thread 1 Thread 2 Thread 3

N/R

Rj

R=4Step j=1

If N/R > coalesce width (CW), no coalescing issues during reads

If Rj> CW, no coalescing issues during writes

If Rj <=CW, write to shared memory, rearrange data across threads, write to global memory with coalescing

Page 17: High Performance Discrete Fourier Transforms  on  Graphics Processors

Shared memory algorithm

• Applied when FFT is computed on data in shared memory of a MP

• Each block has N*M/R threads– M is number of FFTs performed together in a block– Each MP performs M FFTs at a time

• Similar to global memory algorithm– Use Stockham formulation to reduce compute

overheads

Page 18: High Performance Discrete Fourier Transforms  on  Graphics Processors

Shared Memory Algorithm

Thread 0 Thread 1 Thread 2 Thread 3

N/R

Rj

R=4Step j=1

If N/R > numbanks, no bank conflicts during reads

If Rj> numbanks, no bank conflicts during writes

Page 19: High Performance Discrete Fourier Transforms  on  Graphics Processors

Shared Memory Algorithm

Thread 0 Thread 1 Thread 2 Thread 3

N/R

Rj

R=4Step j=1

If Rj <=numbanks, add padding to avoid bank conflicts

Thread 4 Thread 5 Thread 6 Thread 7

0Banks 4 8 12 0 4 8 12

Page 20: High Performance Discrete Fourier Transforms  on  Graphics Processors

Hierarchical FFT

• Decompose FFT into smaller-sized FFTs– Evaluate efficiently using shared memory

algorithm– Combine transposes with FFT computation– Achieve memory coalescing

Page 21: High Performance Discrete Fourier Transforms  on  Graphics Processors

Multiprocessor

Shared Memory

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP

DRAM DRAM DRAM DRAM DRAM DRAM

Hierarchical FFT

W=N/H

CW

H

Perform CW FFTs of size H in shared memory

Page 22: High Performance Discrete Fourier Transforms  on  Graphics Processors

Hierarchical FFT

W=N/H

H

Perform H FFTs of size W recursivelyTranspose

In-place algorithmFinal set of transposes can also be combined with FFT computation

Page 23: High Performance Discrete Fourier Transforms  on  Graphics Processors

Other FFTs

• Non-Power-of-Two sizes– Mixed Radix

• Using powers of 2, 3, 5, etc.– Bluestein’s FFT

• For large prime factors

• Multi-dimensional FFTs– Perform FFTs independently along each dimension

• Real FFTs– Exploit symmetry to improve the performance– Transformed into a complex FFT problem

• DCTs– Computed using a transformation to complex FFT problem

Page 24: High Performance Discrete Fourier Transforms  on  Graphics Processors

Microsoft DFT LibraryKey features supported in our GPU DFT libraryDimension •1D

•2D•3D

Algorithms •Shared memory•Global memory•Hierarchical

Data type •Single•Real•Complex

Runtime •Auto-tuning•Virtualization

Size •Large prime factors•2a, 3b, etc.•Mixed-radix•Multiple transforms

Page 25: High Performance Discrete Fourier Transforms  on  Graphics Processors

Outline

• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms

• Experimental Results• Conclusions and Future Work

Page 26: High Performance Discrete Fourier Transforms  on  Graphics Processors

Experimental Methodology

• Hardware– Intel QX9650 3.0 GHz quad-core processor• Two dual core dies • Each pair of cores shares 6 MB L2 cache

– NVIDIA GTX280 GPU• Driver version 177.41

Name Multi-procs

Shader Clock (MHz)

Peak Perf. (GFlops)

Memory Capacity

(MB)

Peak Bandwidth

(GiB/s)

GTX280 30 1300 930 1024 140

8800 GTX 16 1300 520 768 80

8800 GTS 16 1625 620 512 60

Page 27: High Performance Discrete Fourier Transforms  on  Graphics Processors

Experimental Methodology

• Libraries– Our FFT library written in CUDA• Tested on various GPUs

– NVIDIA’s CUFFT library (v. 1.1)• Results for GTX280 only

– DX9FFT library [Lloyd et al. 2007]• Results for GTX280 only

– Intel’s MKL (v. 10.0.2)• Run on CPU with 4 threads

Page 28: High Performance Discrete Fourier Transforms  on  Graphics Processors

Experimental Methodology

• Notation– N: Size of the FFT– M: Number of FFTs

• Performance– GFlops: M 5N lg(N) / time– Minimum time over multiple runs

• Warm caches on CPU

• Accuracy– Perform forward transform and inverse– Compare result to original input– Root mean square error (RMSE) / 2

Page 29: High Performance Discrete Fourier Transforms  on  Graphics Processors

1D Single FFT

0102030405060708090

100

4 6 8 10 12 14 16 18 20 22 24

GFl

ops

log2N

Ours GTX280

Ours 8800GTX

Ours 8800GTS

CUFFT

MKL

M = 1

Page 30: High Performance Discrete Fourier Transforms  on  Graphics Processors

1D Multi-FFT

M = 223 / N*Driver 177.11

0

50

100

150

200

250

300

1 3 5 7 9 11 13 15 17 19 21 23

GFl

ops

log2 N

Ours GTX280*Ours GTX280

MKL

Ours 8800GTSCUFFTDX9FFT

Entire FFT in shared memory kernel

Page 31: High Performance Discrete Fourier Transforms  on  Graphics Processors

1

2

4

8

16

32

64

1 3 5 7 9 11 13 15 17 19 21 23

Rel.

runti

me

(log)

log2N

MKLCUFFTOurs 8800GTSOurs GTX280

1D Multi-FFT

M = 223 / N

40x

20x

5x

Page 32: High Performance Discrete Fourier Transforms  on  Graphics Processors

1D Mixed Radix

0

20

40

60

80

100

120

GFl

ops

N

OursCUFFTMKL

N = 2a3b5c M= 223/N

Page 33: High Performance Discrete Fourier Transforms  on  Graphics Processors

1D Primes

02468

10121416

GFl

ops

N

Ours

MKL

CUFFT

M= 220/N

Page 34: High Performance Discrete Fourier Transforms  on  Graphics Processors

02468

101214161820

4 5 7 9 11 13 15 17 19 21

GFl

ops

log2 N

Ours

MKL

1D Large Primes

M= 222/N

Page 35: High Performance Discrete Fourier Transforms  on  Graphics Processors

RMSE Error (N=2a)

1.0E-8

1.0E-7

1.0E-6

1 3 5 7 9 11 13 15 17 19 21 23

Erro

r

log2N

CUFFT

Ours

Accurate

MKL

Page 36: High Performance Discrete Fourier Transforms  on  Graphics Processors

RMSE Error (Mixed radix)

1.0E-7

1.0E-6

1.0E-5

6 8 10 12 14 16 18 20 22

Erro

r

log2N

CUFFT

Ours

MKL

Page 37: High Performance Discrete Fourier Transforms  on  Graphics Processors

RMSE Error (primes)

1.0E-7

1.0E-6

1.0E-5

1.0E-4

1.0E-3

1.0E-2

Erro

r (lo

g 10)

N

CUFFTOursMKL

Page 38: High Performance Discrete Fourier Transforms  on  Graphics Processors

Limitations

• Current implementation– Works only on data in GPU memory– No multi-GPU support– No support for double precision

• Hardware Issues– Large data sizes needed to fully utilize GPU– Slow data transfer between GPU and system memory– High accuracy twiddle factors are slow

• Use a table (especially for double precision)– Need to virtualize block index

• Fixed in Microsoft DirectX11

Page 39: High Performance Discrete Fourier Transforms  on  Graphics Processors

Outline

• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms

• Experimental Results• Conclusions and Future Work

Page 40: High Performance Discrete Fourier Transforms  on  Graphics Processors

Conclusions

• Several algorithms for performing FFTs on GPUs– Handle different sizes efficiently– Library chooses appropriate algorithms for a given size

and hardware configuration– Optimized for memory performance

• Combined transposes with FFT computation– Address numerical accuracy issues

• High performance– Up to 300 GFLOPS on current high-end GPUs– Significantly faster than existing GPU-based libraries

and CPU-based libraries for typical large sizes

Page 41: High Performance Discrete Fourier Transforms  on  Graphics Processors

Future Work

• More sophisticated auto-tuning• Add additional functionality:– Double precision– Multi-GPU support– Out-of-core support for very large FFTs

• Port to DirectX11 using Compute Shaders

Page 42: High Performance Discrete Fourier Transforms  on  Graphics Processors

Future of GPGPU

• GPUs are becoming more general purpose– Fewer limitations. Microsoft DirectX11 API:• IEEE floating point support and optional double support• Integer instruction support• More programmable stages, etc.

– Significant advance in performance– Higher level programming languages– Uniform abstraction layer over different hardware

vendors

Page 43: High Performance Discrete Fourier Transforms  on  Graphics Processors

Future of GPGPU• Widespread adoption of GPUs in commercial

applications– Image and media processing, signal processing,

finance, etc.• High performance computing– Can benefit from data-parallel programming– Many opportunities– Microsoft GPU Station at Booth number 1309

Page 44: High Performance Discrete Fourier Transforms  on  Graphics Processors

Acknowledgments

• Microsoft: Chas Boyd, Craig Mundie, Ken Oien• NVIDIA: Henry Moreton, Sumit Gupta, and

David• Peter-Pike Sloan• Vasily Volkov