cuda 3.2 math libraries performance

14
CUDA Toolkit 3.2 Math Library Performance January, 2011

Upload: raymond-tay

Post on 02-Apr-2015

631 views

Category:

Documents


2 download

DESCRIPTION

A comparison of the Math Libraries in CUDA 3.2

TRANSCRIPT

Page 1: CUDA 3.2 Math Libraries Performance

CUDA Toolkit 3.2

Math Library Performance

January, 2011

Page 2: CUDA 3.2 Math Libraries Performance

2

CUDA Toolkit 3.2 Libraries

NVIDIA GPU-accelerated math libraries:

cuFFT – Fast Fourier Transforms Library

cuBLAS – Complete BLAS library

cuSPARSE – Sparse Matrix Library

cuRAND – Random Number Generation (RNG) Library

For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216

Page 3: CUDA 3.2 Math Libraries Performance

3

cuFFT

Multi-dimensional Fast Fourier Transforms

New in CUDA 3.2 :

Higher performance of 1D, 2D, 3D transforms with

dimensions of powers of 2, 3, 5 or 7

Higher performance and accuracy for 1D transform sizes

that contain large prime factors

Page 4: CUDA 3.2 Math Libraries Performance

4

cuFFT 3.2 up to 8.8x Faster than MKL

0

50

100

150

200

250

300

350

400

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Single-Precision 1D Radix-2

0

20

40

60

80

100

120

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Double-Precision 1D Radix-2

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz

* FFTW single-thread on same CPU

1D used in audio processing and as a foundation for 2D and 3D FFTs

Performance may vary based on OS version and motherboard configuration

CUFFT

MKL

FFTW

CUFFT

MKL

FFTW

GF

LO

PS

GF

LO

PS

Page 5: CUDA 3.2 Math Libraries Performance

5

cuFFT 1D Radix-3 up to 18x Faster than MKL

0

20

40

60

80

100

120

140

160

180

200

220

240

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

N (transform size = 3^N)

Single-Precision

0

10

20

30

40

50

60

70

80

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

N (transform size = 3^N)

Double-Precision

GFLOPS GFLOPS

CUFFT (ECC off)

CUFFT (ECC on)

MKL

18x for single-precision, 15x for double-precision

Similar acceleration for radix-5 and -7

* CUFFT 3.2 on Tesla C2050 (Fermi)

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz Performance may vary based on OS version and motherboard configuration

CUFFT (ECC off)

CUFFT (ECC on)

MKL

GF

LO

PS

GF

LO

PS

Page 6: CUDA 3.2 Math Libraries Performance

6

cuFFT 3.2 up to 100x Faster than cuFFT 3.1

0x

1x

10x

100x

0 2048 4096 6144 8192 10240 12288 14336 16384

Sp

ee

du

p

1D Transform Size

Speedup for C2C single-precision Log Scale

cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)

Performance may vary based on OS version and motherboard configuration

Page 7: CUDA 3.2 Math Libraries Performance

7

cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation

Supports all 152 routines for single, double, complex and

double complex

New in CUDA 3.2

7x Faster GEMM (matrix multiply) on Fermi GPUs

Higher performance on SGEMM & DGEMM for all matrix sizes

and all transpose cases (NN, TT, TN, NT)

Page 8: CUDA 3.2 Math Libraries Performance

8

Up to 8x Speedup for all GEMM Types

636

775

301 295

78 80 39 40

0

100

200

300

400

500

600

700

800

900

SGEMM CGEMM DGEMM ZGEMM

GF

LO

PS

GEMM Performance on 4K by 4K matrices

CUBLAS3.2

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz Performance may vary based on OS version and motherboard configuration

Page 9: CUDA 3.2 Math Libraries Performance

9

cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance

0

50

100

150

200

250

300

350

GF

LO

PS

Dimension (m = n = k)

DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS

cuBLAS 3.2

Large perf

variance in

cuBLAS 3.1 7x Faster than MKL

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz

8%

Performance may vary based on OS version and motherboard configuration

cuBLAS 3.1

MKL 10.2.3

Page 10: CUDA 3.2 Math Libraries Performance

10

cuSPARSE

New library for sparse linear algebra

Conversion routines for dense, COO, CSR and CSC formats

Optimized sparse matrix-vector multiplication

1.0

2.0

3.0

4.0

y1

y2

y3

y4

\alpha + \beta

1.0

6.0

4.0

7.0

3.0 2.0

5.0

y1

y2

y3

y4

Page 11: CUDA 3.2 Math Libraries Performance

11

32x Faster

1

2

4

8

16

32

64

Sp

ee

du

p o

ve

r M

KL

1 Sparse Matrix x 6 Dense Vector

Speedup of cuSPARSE vs MKL

Performance may vary based on OS version and motherboard configuration

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 3.07GHz

single

double

complex

double-complex

Log Scale

Page 12: CUDA 3.2 Math Libraries Performance

12

Performance of 1 Sparse Matrix x 6 Dense Vectors

0

20

40

60

80

100

120

140

160

GF

LO

PS

Test cases roughly in order of increasing sparseness

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration

single

double

complex

double-complex

Page 13: CUDA 3.2 Math Libraries Performance

13

cuRAND

Library for generating random numbers

Features supported in CUDA 3.2

XORWOW pseudo-random generator

Sobol’ quasi-random number generators

Host API for generating random numbers in bulk

Inline implementation allows use inside GPU functions/kernels

Single- and double-precision, uniform and normal distributions

Page 14: CUDA 3.2 Math Libraries Performance

14

cuRAND performance

0

2

4

6

8

10

12

14

16

18

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Sobol’ Quasi-RNG (1 dimension)

0

2

4

6

8

10

12

14

16

18

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Gig

aS

am

ple

s / s

ec

on

d

XORWOW Psuedo-RNG

* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration