prototyping and developing gpu-accelerated solutions with ...allows faster prototype development...
TRANSCRIPT
Luciano Martins and Robert Sohigian, 2018-11-22
Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA
2
Agenda
Introduction to Python
GPU-Accelerated Computing
NVIDIA® CUDA® technology
Why Use Python with GPUs?
Methods: PyCUDA, Numba, CuPy, and scikit-cuda
Summary
Q&A
3
Introduction to Python
Released by Guido van Rossum in 1991
The Zen of Python:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Interpreted language (CPython, Jython, ...)
Dynamically typed; based on objects
4
Introduction to Python
Small core structure:
~30 keywords
~ 80 built-in functions
Indentationis
apretty serious thing
Dynamically typed; based on objects
Binds to many different languages
Supports GPU acceleration via modules
5
Introduction to Python
6
Introduction to Python
7
Introduction to Python
8
GPU-Accelerated Computing
“[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA)
Most common GPU-accelerated operations:
Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS)
Speech recognition
Computer vision
9
GPU-Accelerated Computing
Important concepts for GPU-accelerated computing:
Host ― the machine running the workload (CPU)
Device ― the GPUs inside of a host
Kernel ― the code part that runs on the GPU
SIMT ― Single Instruction Multiple Threads
10
GPU-Accelerated Computing
11
GPU-Accelerated Computing
12
CUDA
Parallel computing platform and programming model developed by NVIDIA:
Stands for Compute Unified Device Architecture
Based on C/C++ with some extensions
Fairly short learning curve for those with experience of OpenMP and MPI programming
CUDA on a system has three components:
Driver (software that controls the graphics card)
Toolkit (nvcc, several libraries, debugging tools)
SDK (examples and error-checking utilities)
13
CUDA
A kernel is executed as a grid of thread blocks
All threads within a block share a portion of data memory
A thread block is a batch of threads that can cooperate with each other by:
Synchronizing their execution to providehazard-free common memory accesses
Efficiently sharing data through low-latency shared memory
Multiple blocks are combined to form a grid
Blocks on a grid contain the same number of threads
14
CUDA
DeviceHost
Grid 1Kernel 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(1, 1)
Grid 2Kernel 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Block
(0, 1)
Block
(2, 1)
15
CUDA
The host performs the following tasks (CPU):
1.Initializes GPU card(s)
2.Allocates memory in host and on device
3.Copies data from host to device memory
4.Launches instances of the kernel on device(s)
5.Copies data from device memory to host
6.Repeats 3-5 as needed
7.De-allocates all memory and terminates
16
“ Python is an interpreted, object-oriented, high-level
programming language with dynamic semantics. Its high-
level built-in data structures, combined with dynamic
typing and dynamic binding, make it very attractive for
Rapid Application Development, as well as for use as a
scripting or glue language to connect existing
components together.”
https://www.python.org/doc/essays/blurb/
17
Python (and the Need for Speed)
Since interpreted and high-level languages can be slow for high-performance needs, Python needs assistance for those tasks.
Keep the best of both scenarios:
Quick development and prototyping with Python
Use high-processing power and speed of GPU
18
Accelerating Python
Accelerated code may be pure Python or also involve C-code.
Focusing here on the following modules:
PyCUDA
Numba
CuPy
scikit-cuda
19
PyCUDA
A Python wrapper to the CUDA API
Gives speed to Python – near zero wrapping
Requires C programming knowledge (kernel)
Compiles the CUDA code and copies to GPU
CUDA errors translated to Python exceptions
Easy installation (pip)
20
PyCUDA
21
PyCUDA
22
Numba
No need to write C-code
High-performance functions written in Python
On-the-fly code generation
Native code generation for the CPU and GPU
Integration with the Python scientific stack
Takes advantage of Python decorators
Code translation done using LLVM compiler
23
Numba
https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html
24
Numba
25
CuPy
An implementation of NumPy-compatible multi-dimensional array on CUDA
Useful to perform matrix ops on GPUs
Provides easy ways to define three types of CUDA kernels:
Elementwise kernels
Reduction kernels
Raw kernels
Also easy to install (pip)
26
CuPy
27
CuPy
Array Size NumPy [ms] CuPy [ms]
104 0.03 0.58
105 0.20 0.97
106 2.00 1.84
107 55.55 12.48
108 517.17 84.73
28
scikit-cuda
Motivated by the idea of enhancing PyCUDA
Exposes GPU powered libraries
Tested on Linux (potentially works elsewhere)
Can be seen as “SciPy on GPU juice”
Presents low-level and high-level functions
29
scikit-cuda
Low-Level Functions
Wrapping C functions via ctypes
Catching errors and mapping to Python exceptions
High-Level Functions
Take advantage of PyCUDA GPUArray to manipulate matrices in GPU memory
Some high-level functions available include FFT/IFFT, numerical integration, randomized linear algebra, NumPy-like routines not available on PyCUDA (cumsum, zeros, etc)
30
scikit-cuda
31
Summary
Many projects ported to Python are available
Keeps the simplicity of Python whilst adding GPU performance
Allows faster prototype development cycles
Supports C performance depending on the module (approach) chosen
Goes through matrices operation, scientific programming to custom kernels creation
32
Summary Pages
33
PyCUDA Summary
PyCUDA
CUDA Python wrapper
C code added directly on the Python project
All CUDA libraries support
Relevant complexity due to the kernels in C
https://documen.tician.de/pycuda/
34
Numba Summary
Numba
Similar coverage as PyCUDA
No C coding needed
Takes advantage of LLVM and JIT compiling
Missing: dynamic parallelism and texture memory
http://numba.pydata.org/doc.html
35
CuPy Summary
CuPy
Fully supports NumPy structures
Performs same operations at scale using GPU
Allows CPU/GPU agnostic code creation
https://docs-cupy.chainer.org/en/stable
36
scikit-cuda Summary
scikit-cuda:
Scientific computing using Python and GPU
Presents high-level and low-level functions
Big coverage of operations already available
Depends on PyCUDA GPUArray mechanisms
http://scikit-cuda.readthedocs.io/en/latest/