advanced cuda 01 - técnico lisboa - autenticação · and programming model includes a cuda c...
TRANSCRIPT
© NVIDIA Corporation 2010
Advanced CUDA Optimization
1. Introduction
Thomas Bradley
© NVIDIA Corporation 2010
Agenda
CUDA Review
Review of CUDA Architecture
Programming & Memory Models
Programming Environment
Execution
Performance
Optimization Guidelines
Productivity
Resources
© NVIDIA Corporation 2010
REVIEW OF CUDA ARCHITECTURE
CUDA Review
© NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPU
memory
PCI Bus
© NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
PCI Bus
© NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
PCI Bus
© NVIDIA Corporation 2010
CUDA Parallel Computing Architecture
Parallel computing architecture
and programming model
Includes a CUDA C compiler,
support for OpenCL and
DirectCompute
Architected to natively support
multiple computational
interfaces (standard languages
and APIs)
© NVIDIA Corporation 2010
CUDA Parallel Computing Architecture
CUDA defines:
Programming model
Memory model
Execution model
CUDA uses the GPU, but is for general-purpose computing
Facilitate heterogeneous computing: CPU + GPU
CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads
© NVIDIA Corporation 2010
PROGRAMMING MODEL
CUDA Review
© NVIDIA Corporation 2010
CUDA Kernels
Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
© NVIDIA Corporation 2010
CUDA Kernels: Parallel Threads
A kernel is a function executed
on the GPU
Array of threads, in parallel
All threads execute the same
code, can take different paths
Each thread has an ID
Select input/output data
Control decisions
float x = input[threadID];
float y = func(x);
output[threadID] = y;
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
© NVIDIA Corporation 2010
Communication Within a Block
Threads may need to cooperate
Memory accesses
Share results
Cooperate using shared memory
Accessible by all threads within a block
Restriction to “within a block” permits scalability
Fast communication between N threads is not feasible when N large
© NVIDIA Corporation 2010
Transparent Scalability – G84
1 2 3 4 5 6 7 8 9 10 11 12
1 2
3 4
5 6
7 8
9 10
11 12
© NVIDIA Corporation 2010
Transparent Scalability – G80
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8
9 10 11 12
© NVIDIA Corporation 2010
Transparent Scalability – GT200
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12...
Idle Idle Idle
© NVIDIA Corporation 2010
CUDA Programming Model - Summary
A kernel executes as a grid of
thread blocks
A block is a batch of threads
Communicate through shared
memory
Each block has a block ID
Each thread has a thread ID
Host
Kernel 1
Kernel 2
Device
0 1 2 3
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
1D
2D
© NVIDIA Corporation 2010
MEMORY MODEL
CUDA Review
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
© NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
© NVIDIA Corporation 2010
Additional Memories
Host can also allocate textures and arrays of constants
Textures and constants have dedicated caches
© NVIDIA Corporation 2010
PROGRAMMING ENVIRONMENT
CUDA Review
© NVIDIA Corporation 2010
CUDA C and OpenCL
Shared back-end compiler
and optimization technology
Entry point for developers
who want low-level APIEntry point for developers
who prefer high-level C
© NVIDIA Corporation 2010
Visual Studio
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Compilation rules: cuda.rules
Syntax highlighting
Intellisense
Integrated debugger and
profiler: Nexus
© NVIDIA Corporation 2010
NVIDIA Nexus IDE
The industry’s first IDE for massively
parallel applications
Accelerates co-processing (CPU + GPU)
application development
Complete Visual Studio-integrated
development environment
© NVIDIA Corporation 2010
Linux
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Typically makefile driven
cuda-gdb for debugging
CUDA Visual Profiler
© NVIDIA Corporation 2010
OPTIMIZATION GUIDELINES
Performance
© NVIDIA Corporation 2010
Optimize Algorithms for GPU
Algorithm selectionUnderstand the problem, consider alternate algorithms
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Recompute?GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache
Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host
© NVIDIA Corporation 2010
Optimize Memory Access
Coalesce global memory access
Maximise DRAM efficiency
Order of magnitude impact on performance
Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics
Understand spatial locality
Optimize use of textures to ensure spatial locality
© NVIDIA Corporation 2010
Exploit Shared Memory
Hundreds of times faster than global memory
Inter-thread cooperation via shared memory and synchronization
Cache data that is reused by multiple threads
Stage loads/stores to allow reordering
Avoid non-coalesced global memory accesses
© NVIDIA Corporation 2010
Use Resources Efficiently
Partition the computation to keep multiprocessors busyMany threads, many thread blocks
Multiple GPUs
Monitor per-multiprocessor resource utilizationRegisters and shared memory
Low utilization per thread block permits multiple active blocks per multiprocessor
Overlap computation with I/OUse asynchronous memory transfers
© NVIDIA Corporation 2010
RESOURCES
Productivity
© NVIDIA Corporation 2010
Getting Started
CUDA Zone
www.nvidia.com/cuda
Introductory tutorials/webinars
Forums
Documentation
Programming Guide
Best Practices Guide
Examples
CUDA SDK
© NVIDIA Corporation 2010
Libraries
NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suite)
cuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open Source
Thrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and reduction)
CUSP Sparse linear algebra and graph computation
Many more...