krishnan suresh (“suresh”) [email protected]...

55
Popular CUDA Packages Krishnan Suresh (“Suresh”) [email protected] Associate Professor Mechanical Engineering

Upload: lekhue

Post on 03-May-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

Popular CUDA Packages

Krishnan Suresh (“Suresh”)

[email protected]

Associate Professor

Mechanical Engineering

Page 2: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

2

Take-Home Message

• Don’t reinvent the wheel!

• Minimize custom Kernels

Page 3: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

Conjugate Gradient

� Solve Ax = b via CG (Matlab)

GPU algorithms:

� Dot-product: Use CUBLAS

� Ax: Use CUSPARSE

� ax+b: Use CUBLAS

Page 4: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

CUDA Libraries & Packages

Page 5: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

5

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, …

3. CuSparse: Sparse Linear Algebra Package

4. Jacket: Matlab Wrapper

5. CULA: Dense and sparse linear algebra

6. MAGMA: Multicore linear algebra

7. CUFFT: Fast Fourier Transform

8. …

Page 6: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

6

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, …

3. CuSparse: Sparse Linear Algebra Package

4. Jacket: Matlab Wrapper

5. CULA: Dense and sparse linear algebra

6. MAGMA: Multicore linear algebra

7. CUFFT: Fast Fourier Transform

8. …

Page 7: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

7

CUBLAS

• CUDA implementation of BLAS (Basic

Linear Algebra Subprograms)

– Vector, vector (Level-1)

– Matrix, vector (Level-2)

– Matrix, matrix (Level-3)

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• No kernel calls, shared memory, etc

Page 8: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

CUBLAS Library Usage

� No additional downloads needed

– cublas.lib (in CUDA SDK)

– Add cublas.lib to linker

– #include cublas.h

8

Page 9: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

9

CUBLAS Code Structure

1. Initialize CUBLAS: cublasInit()2. Create CPU memory and data

3. Create GPU memory: cublasAlloc(6)

4. Copy from CPU to GPU : cublasSetVector(6)

5. Operate on GPU : cublasSgemm(6)

6. Check for CUBLAS error : cublasGetError()

7. Copy from GPU to CPU : cublasGetVector(6)8. Verify results

9. Free GPU memory : cublasFree(6)

10. Shut down CUBLAS : cublasShutDown()

Page 10: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

10

CUBLAS BLAS-1 Functions: Vector-vector operations

Page 11: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

11

CU(BLAS) Naming Convention

cublasIsamax

Index of

Single

Precision

absolute

cublasIdamax

Find the index of the absolute max

of a vector of single precision reals

cublasIzamax

cublasIcamax

max

Page 12: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

12

CU(BLAS) Naming Convention

cublasSaxpy

Single

Precision

alpha*x+y

cublasDaxpy

Compute alpha*x+y where

x &y are single precision reals

& alpha is a scalar

Page 13: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

13

CUBLAS Example-1 (CPU)

Ta x y=

Page 14: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

CUBLAS Example-1 (GPU)

Ta x y=

• No kernel calls

• No memory mgmt.

Increment of 1

14

Page 15: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

15

CUBLAS Example-2 (CPU)

z x yα= +

Page 16: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

CUBLAS Example-2 (GPU)

z x yα= +

Output stored

in d_y

16

Page 17: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

CUBLAS BLAS-2 Functions: Matrix-Vector Operations

:

z Ax y

A symmetric banded

α β= +

1

( )

x A y

A Upper or Lower

α −=

=17

Page 18: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

18

CUBLAS: Caveats

• Solves Ax = b only for Upper/Lower A

• Limited class of sparse matrices

• Column format & 1-indexing (Fortran style)

• C: row format & 0-indexing; use macros

Page 19: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

19

CU(BLAS) Naming Convention

cublasSsbmv

Single

symmetric

banded

z Ax yα β= +

xxx

xxxx

xxxxx

xxxx

xxX

Page 20: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

Example

z Ax yα β= +

( , )

2 1

1 2 1

1 2 ...

... ... 1

1 2N N

A

− − − = −

− −

It is sufficient to store

( , )

2 1

2 1

2 ...

... 1

2N N

− −

(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

Stored as

Symmetric-Banded

#Super-Diagonals = 1

20

Page 21: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

21

CUBLAS Example-3 (CPU)

z Ax yα β= +(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

Macro for 0-indexing in C

2

1_ :

2

1

...

X

h A

− −

Page 22: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

22

CUBLAS Example-3 (CPU)

(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

1 1 1

2 2 2

3 3 3

2 1

1 2 1

1 2 ...

... ... ...... ... 1

1 2N N N

z x y

z x y

z x y

z x y

α β

− − − = +−

− −

Page 23: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

CUBLAS Example-3 (GPU)

z Ax yα β= +(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

#Rows

Upper

diagonal

#Rows

23

Page 24: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

24

CUBLAS Optimal Usage

1. Copy from CPU to GPU : cublasSet 6(6)2. Operate on GPU

� Operation 1

� Operation 2

� 6

� Operation n

3. Copy from GPU to CPU : cublasGet6(6)

Page 25: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

25

CUBLAS BLAS-3 Functions: Matrix-Matrix Operations

C AB Cα β= +

1

( )

X A B

A Upper or Lower

α −=

=

Page 26: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

26

CUBLAS Performance

Page 27: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

27

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, …

3. CuSparse: Sparse Linear Algebra Package

4. Jacket: Matlab Wrapper

5. CULA: Dense and sparse linear algebra

6. MAGMA: Multicore linear algebra

7. CUFFT: Fast Fourier Transform

8. …

Page 28: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

28

Thrust

• C++ Template Library using CUDA

• Vector containers:• host_vector & device_vector

• Generalizes std:vector

• Store any type & dynamically resize

• Numerous algorithms• Sort

• Sum

• Max

Page 29: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

29

Thrust: Getting started

� Download to (CUDA include directory)

– http://code.google.com/p/thrust/

– Requires CUDA 2.3

� Tutorial:

– http://code.google.com/p/thrust/wiki/Tutorial

Page 30: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

30

Thrust: Concept

Page 31: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

31

Thrust Algorithms: Prefix Sum

� Given a sequence:

� And an operation

� Output:

{ }1 2 3, , ,..., Nx x x x

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x⊕ ⊕ ⊕ ⊕ ⊕ ⊕

Page 32: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

32

Prefix Sum

� Key to numerous algorithms

� Also referred to as “Scan” algorithm

� Different operations result in different results

Page 33: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

33

Prefix Sum: Example

� Given a sequence:

� And an operation

� Output

{ }1,2,9,6,...,

+

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x+ + + + + +

{ }1,3,11,17,...

Page 34: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

34

Prefix Sum: Example

� Given a sequence:

� And an operation

� Output

{ }1,2,9,6,...,

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x∗ ∗ ∗ ∗ ∗ ∗

{ }1,2,18,108,...

Page 35: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

35

Prefix Sum: Example

� Given a sequence:

� And an operation

� Output

{ }1,2,9,6,...,

max

{ }1 1 2 1 2 3,max( , ),max(max( , ), ),...x x x x x x

{ }1,2,9,9,...

Page 36: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

36

Thrust: Examples Set-up

Page 37: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

37

Thrust: Examples

Page 38: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

38

Thrust: Examples cont.

2 2 2

1 2 ... Na x x x x= = + + +

Page 39: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

39

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, …

3. CuSparse: Sparse Linear Algebra Package

4. Jacket: Matlab Wrapper

5. CULA: Dense and sparse linear algebra

6. MAGMA: Multicore linear algebra

7. CUFFT: Fast Fourier Transform

8. …

Page 40: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

40

CuSparse

Linear Algebra for sparse matrices using CUDA

Page 41: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

41

CuSparse

Page 42: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

42

CuSparse

Page 43: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

43

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, …

3. CuSparse: Sparse Linear Algebra Package

4. CULA: Dense and sparse linear algebra

5. Jacket: Matlab Wrapper

6. MAGMA: Multicore linear algebra

7. CUFFT: Fast Fourier Transform

8. …

Page 44: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

44

CULA Sparse

Page 45: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

45

CUFFT

CUDA Implementation of

Fast Fourier Transform

Page 46: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

46

Fourier Transform

• Extract frequencies from signal

• Given a function

• 1-D Fourier transform:

• 2-D, 3-D

( );f t t−∞< <∞

2(̂ ) ( ) i tf f t e dtπ ξξ

∞−

−∞

= ∫

Page 47: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

47

Fourier Transform

Continuous Signal Fourier Transform

(Wikipedia)

2ˆ( ) ( ) i tf t f e dπ ξξ ξ

−∞

= ∫

Page 48: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

48

Discrete Fourier Transform

• Given a sequence

• Discrete Fourier transform (DFT):

6 another sequence

0 1 1, ,..., Nx x x −

21

0

ˆiknN

Nk n

n

x x eπ− −

=

=∑

Page 49: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

49

DFT Examples

Highest frequency

that can be captured

correctly

Page 50: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

50

Fast Fourier Transform

• DFT: Naïve O(N2) operation

• FFT: Fast DFT, O(NlogN)

• Key to signal processing, PDE, 6

0 1 1, ,..., Nx x x − 0 1 1ˆ ˆ ˆ, ,..., Nx x x −

21

0

ˆiknN

Nk n

n

x x eπ− −

=

=∑

Page 51: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

51

CUFFT

� Fast CUDA library for FFT

� No additional downloads needed

– cufft.lib (in CUDA SDK)

– Add cufft.lib to linker

– #include cufft.h

Page 52: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

52

CUFFT: Features

• 1-D, 2-D, 3-D

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• Uses CUDA memory calls & fft data

• Requires a ‘plan’

• Based on FFTW model

Page 53: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

53

CUFFT Example

Page 54: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

54

CUFFT Example (cont.)

Complex to

complex

1 data

(batch)

Page 55: Krishnan Suresh (“Suresh”) suresh@engr.wisc.edu …outreach.sbel.wisc.edu/Workshops/GPUworkshop/2012/... · CUBLAS Library Usage No additional downloads needed

Acknowledgements

� Graduate Students

� NSF

� UW-Madison

� Kulicke and Soffa

� Luvata

� Trek Bicycles

Publications available at

www.ersl.wisc.edu

Email

[email protected]