gpu architecture and programming model · a multiprocessor of g80 has one unit executing an...
TRANSCRIPT
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
GPU Architecture and Programming Model
Jirı Filipovic
Fall 2015
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Differences among CUDA GPUs
New generations bring higher performance and new computingcapabilities.
compute capability describes richness of GPU instruction setand amount of resources available (registers, number ofconcurrently running threads, etc.)
raw performance grows with the number of cores on a GPU
Cards in the same generation differ in performance substantially
to produce more affordable cards
due to changes introduces later in the manufacturing process
to minimize power consumption of mobile GPUs
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
GPUs Available Today
Currently available GPUs
compute capability 1.0 - 5.2
we will learn the differences later
1–30 multiprocessors (19.2 GFlops - 6.14 TFLOPs)
frequency of 800 MHz–1.836 GHz
width and speed of data bus (64–512 bit, 6.4–336 GB/s)
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Generations of CUDA GPU
Generations and their computing capability
Tesla (G80, G90, G200): c.c. 1.0, 1.1, 1.2, 1.3
do not confuse with Tesla computing cards
Fermi (GF100, GF110): c.c. 2.0, 2.1
Kepler (GK100, GK110): c.c. 3.0, 3.5, 3.7
Maxwell (GM107, GM200): c.c. 5.0, 5.2
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Available products
GeForce graphics cards
mainstream solution for gaming
cheap, widely used, broad range of performance
disadvantage – limited memory, limited double precisionperformance
Professional Quadro graphics cards
the same as GeForce from CUDA perspective
larger memory
several times more expensive
Tesla
a solution specially designed for CUDA computing
always large memory
expensive, interesting for computing centers and personalsupercomputers
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
GPU Parallelism
Parallel algorithms need to be designed w.r.t. the parallelismavailable in the HW
GPU: array of SIMT multiprocessors working using sharedmemory
Decomposition for GPU
coarse-grained decomposition of the problem into the partsthat don’t need intensive communication
fine-grained decomposition similar to vectorization (but SIMTis more flexible)
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Task Hierarchy
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
SIMT
A multiprocessor of G80 has one unit executing an instruction
all 8 SPs have to execute the same instruction
new instruction is executed every 4 cycles
32 threads (so called warp) need to execute the sameinstruction, warp size is fixed for all existing CUDA hardware
How about code branching?
if different parts of a warp perform different instructions, theyare serialized
decreases performance—should be avoided
The multiprocessor is thus (nearly) MIMD (Multiple-InstructionMultiple-Thread) from programmer’s perspective and SIMT(Single-Instruction Multiple-Thread) from performance perspective.
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
GPU Architecture
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
SIMT reconvergence
At the end of divergent code, a point of reconvergence is set bythe compiler
creates barrier for threads within the warp
guarantees threads synchronization after divergent code
we have to take the reconvergence points in mind – they cancreate deadlocks, which are not arise in true MIMD
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
SIMT reconvergence
We try to serialize some region of code by following construct:
__shared__ int s = 0 ;while ( s != threadIdx . x ) ;// s e r i a l i z e d r e g i o ns++;
Thanks to reconvergence point, there is a deadlock (reconvergencepoint is placed before the incrementation of s).Fix:
__shared__ int s = 0 ;while ( s < blockDim . x )
if ( threadIdx . x == s ) {// s e r i a l i z e d r e g i o ns++;
}
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Thread Properties
GPU threads are very lightweight compared to CPU threads.
their run time can be very shorts (even tens of instructions)
there should be many of them
they should not use large amount of resources
Threads are aggregated into blocks
all threads of the block always run on the same multiprocessor(multiple blocks can run at one multiprocessor)
having sufficient number of blocks is substantial to achievegood scalability
Number of threads and thread blocks per multiprocesor is limited.
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Memory Latency Masking
Memory has latency
global memory has high latency (hundreds of cycles)
registers and shared memory have read-after-write latency
Memory latency hiding is different from CPU
no instructions are executed out of order (but ILP can beexploited by forcing finalization of load instruction just beforeloaded data are needed)
no or limited cache
When a warp waits for data from memory, another warp may beexecuted
allows memory latency hiding
requires execution of more threads than the number of GPUcores
thread execution scheduling and switching is implementeddirectly in HW without overhead
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Thread-Local Memory
Registers
the fastest memory, directly usable in instructions
local variables in a kernel and variables for intermediate resultsplaced automatically into the registers
if there is sufficient number of registersif the compiler can determine static array indexing
thread scoped
Local memory
data that doesn’t fit into the registers go into the localmemory
local memory is stored in DRAM =⇒ slow, high latency
thread scoped
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Thread-Local Memory
Registers
the fastest memory, directly usable in instructions
local variables in a kernel and variables for intermediate resultsplaced automatically into the registers
if there is sufficient number of registersif the compiler can determine static array indexing
thread scoped
Local memory
data that doesn’t fit into the registers go into the localmemory
local memory is stored in DRAM =⇒ slow, high latency
thread scoped
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Shared Memory
Shared memory
as fast as registers for c. c. 1.x, for newer GPUs little bitslower
if memory bank conflicts are avoidedinstructions can use only one operand in shared memory(otherwise explicit load/store is needed)
declared using shared in C for CUDA
a variable in shared memory can have dynamic size(determined at startup), if declared as extern withou sizespecification
block scoped
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Shared Memory
Static shared memory declaration
__shared__ float myArray [ 1 2 8 ] ;
Dynamic allocation
extern __shared__ char myArray [ ] ;float ∗array1 = ( float ∗) myArray ;int ∗array2 = ( int∗)&array1 [ 1 2 8 ] ;short ∗array3 = ( short∗)&array2 [ 2 5 6 ] ;
It creates an array array1 of float type with size 128, array2 ofint type sized 256, and array3 of floating size. Total size has tobe specified at kernel startup.
myKernel<<<grid , block , n>>>();
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Global Memory
Global memory
an order of magnitude lower bandwidth compared to sharedmemory
latency in order of hundreds for GPU cycles
addressing needs to be aligned to get optimum performance
application-scoped
cached in some architectures, e.g. L1 cache (128 bytes/row)and L2 cache (32 bytes/row) in Fermi architecture
May be dynamically allocated using cudaMalloc or staticallyallocated using device declaration.
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Constant Memory
Constant memory
read-only
cached
cache hit is as fast as registry (under certain constraints),cache miss is as fast as global memory
limited size (64 kB for currently available GPUs)
application-scoped
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Constant Memory
Declared using constant keyword; the following function isused for copying data to constant memory:
cudaError_t cudaMemcpyToSymbol ( const char ∗symbol ,const void ∗src , size_t count , size_t offset ,enum cudaMemcpyKind kind )
Data are copied from system memory(cudaMemcpyHostToDevice) or global memory(cudaMemcpyDeviceToDevice) from src into symbol. Thecopied block has count bytes. Copied with offset into thesymbol memory.
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Texture Memory
Texture memory
cached, 2D locality
read-only for cache coherency reasons
high latency
several addressing modes
normalization into [0, 1] rangetruncation or overflowing of coordinates
possible data filtering
linear interpolation or nearest value
this functionality is “for free” (implemented in HW)
More details are available in CUDA Programming Guide.
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Data Cache
Read-only data cache
c.c. 3.5 or higher
same hardware as texture cache, but straightforward usage
compiler automatically uses data cache, when it recognizethat data are read-only
we can help with const and restrict
usage can be forced by ldg()
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
System-Local Memory
System RAM
connected to GPU via PCIe
CPU (host) and GPU (device) memory transfers arecomplicated by virtual addressing
it is possible to allocate so called page-locked memory areas
overall system performance may be reducedlimited sizedata are transferred faster over PCIeallows for parallel kernel run and data copyingallows for mapping of host address space onto the deviceallows for write-combining access (data are not cached byCPU)
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Page Locked Memory
cudaMallocHost() is used instead of malloc() to allocate thememory; the memory is freed using cudaFreeHost()
cudaHostAllocPortable flag ensures page-locked memoryfor all CPU threads
cudaHostAllocWriteCombined flag turns off caching forCPU allocated memory
cudaHostAllocMapped flag sets host memory mapping in thedevice address space
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Synchronization within the Block
Within block
native barrier synchronization
all threads have to enter it (beware of conditions!)one instruction only, very fast if it doesn’t degrade parallelismC for CUDA call syncthreads()Fermi extensions: count, and, or
shared memory communication
threads can exchange databarrier ensures that data are ready
synchronization latency hiding similar as for memory
multiple blocks on multiprocessor
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Block Synchronization
Among blocks
global memory is visible for all blocks
poor support for synchronization
no global barrieratomic operations on global memory for newer GPUsglobal barrier can be implemented using kernel calls (anothersolution is quite tricky)poor options for global synchronization make programminghard but allow for very good scalability
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Matrix Multiplication
We want to multiply matrices A a B and store the result into C .For sake of simplicity, we only assume matrices sized n × n.
Ci ,j =∑n
k=1 Ai ,k · Bk,j
C language:
for ( int i = 0 ; i < n ; i++)for ( int j = 0 ; j < n ; j++){
C [ i∗n + j ] = 0 . 0 ;for ( int k = 0 ; k < n ; k++)
C [ i∗n + j ] += A [ i∗n + k ] ∗ B [ k∗n + j ] ;}
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Parallelization
for ( int i = 0 ; i < n ; i++)for ( int j = 0 ; j < n ; j++){
C [ i∗n + j ] = 0 . 0 ;for ( int k = 0 ; k < n ; k++)
C [ i∗n + j ] += A [ i∗n + k ] ∗ B [ k∗n + j ] ;}
Multiple ways of parallelization
choose one loop
choose two loops
parallelize all the loops
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Parallelization
Parallelization of one loop
doesn’t scale well, it is necessary to use big matrices (we needtens thousands of threads for good GPU utilization)
Parallelization of two loops
scales well, number of threads grows quadratically w.r.t. n
Parallelization using inner loop
bad, synchronization needed when writing into C !
Best way is thus to parallelize loops over i and j .
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Parallelization
Parallelization of one loop
doesn’t scale well, it is necessary to use big matrices (we needtens thousands of threads for good GPU utilization)
Parallelization of two loops
scales well, number of threads grows quadratically w.r.t. n
Parallelization using inner loop
bad, synchronization needed when writing into C !
Best way is thus to parallelize loops over i and j .
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Parallelization
Parallelization of one loop
doesn’t scale well, it is necessary to use big matrices (we needtens thousands of threads for good GPU utilization)
Parallelization of two loops
scales well, number of threads grows quadratically w.r.t. n
Parallelization using inner loop
bad, synchronization needed when writing into C !
Best way is thus to parallelize loops over i and j .
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Parallelization
Parallelization of one loop
doesn’t scale well, it is necessary to use big matrices (we needtens thousands of threads for good GPU utilization)
Parallelization of two loops
scales well, number of threads grows quadratically w.r.t. n
Parallelization using inner loop
bad, synchronization needed when writing into C !
Best way is thus to parallelize loops over i and j .
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
First Kernel
We can form the block and grid as 2D array.
__global__ void mmul ( float ∗A , float ∗B , float ∗C , int n ){int x = blockIdx . x∗blockDim . x + threadIdx . x ;int y = blockIdx . y∗blockDim . y + threadIdx . y ;
float tmp = 0 ;for ( int k = 0 ; k < n ; k++)
tmp += A [ y∗n+k ] ∗ B [ k∗n+x ] ;
C [ y∗n + x ] = tmp ;}
Note similarity to math description – parallel version is moreintuitive than the serial one!
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Performance
What will be the performance of our implementation?
Let’s look at GeForce GTX 280
available 622 GFLOPS for matrix multiplication
memory bandwidth is 142 GB/s
Flop-to-word ratio of our implementation
in one step over k, we read 2 floats (one number from A andB) and perform two arithmetic operations
one arithmetic operation corresponds to transfer of one float
global memory offers throughput of 35.5 billion floats persecond if one warp transfers one float from one matrix and 16floats from the other matrix, we can achieve 66.8 GFLOPS
66.8 GFLOPS is very far from 622 GFLOPS
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Performance
What will be the performance of our implementation?Let’s look at GeForce GTX 280
available 622 GFLOPS for matrix multiplication
memory bandwidth is 142 GB/s
Flop-to-word ratio of our implementation
in one step over k, we read 2 floats (one number from A andB) and perform two arithmetic operations
one arithmetic operation corresponds to transfer of one float
global memory offers throughput of 35.5 billion floats persecond if one warp transfers one float from one matrix and 16floats from the other matrix, we can achieve 66.8 GFLOPS
66.8 GFLOPS is very far from 622 GFLOPS
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Performance
What will be the performance of our implementation?Let’s look at GeForce GTX 280
available 622 GFLOPS for matrix multiplication
memory bandwidth is 142 GB/s
Flop-to-word ratio of our implementation
in one step over k, we read 2 floats (one number from A andB) and perform two arithmetic operations
one arithmetic operation corresponds to transfer of one float
global memory offers throughput of 35.5 billion floats persecond if one warp transfers one float from one matrix and 16floats from the other matrix, we can achieve 66.8 GFLOPS
66.8 GFLOPS is very far from 622 GFLOPS
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
How to Improve It?
We hit the limit of global memory. GPUs have faster types ofmemory, can we use them?
For computation of one C element, we have to read one row fromA and one column from B, that are in the global memory.Is it really necessary to do that separately for each element of C?
we read the same A row for all the elements in the same rowof C
we read the same B column for all the elements in the samecolumn of C
we can read some data only once from the global memory intothe shared memory and then read them repeatedly from theshared memory
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
How to Improve It?
We hit the limit of global memory. GPUs have faster types ofmemory, can we use them?For computation of one C element, we have to read one row fromA and one column from B, that are in the global memory.
Is it really necessary to do that separately for each element of C?
we read the same A row for all the elements in the same rowof C
we read the same B column for all the elements in the samecolumn of C
we can read some data only once from the global memory intothe shared memory and then read them repeatedly from theshared memory
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
How to Improve It?
We hit the limit of global memory. GPUs have faster types ofmemory, can we use them?For computation of one C element, we have to read one row fromA and one column from B, that are in the global memory.Is it really necessary to do that separately for each element of C?
we read the same A row for all the elements in the same rowof C
we read the same B column for all the elements in the samecolumn of C
we can read some data only once from the global memory intothe shared memory and then read them repeatedly from theshared memory
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Tiled Algorithm
If we access the matrix in tiles, we can amortize transfers from theglobal memory:
we will compute a× a tile of C matrix
we read tiles of the same size of matrices A and B into theshared memory iteratively
the tiles will be multiplied and added to C
arithmetic operations to data transfers is a times better
Natural mapping on GPU parallelism
each tile of C will be computed by a thread block
shared memory locality ensured
no inter-block synchronization needed
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Tiled Algorithm
How big blocks?
limited by the size of shared memory
limited by the number of threads that can run on GPU
if one thread is to compute one element of C , a reasonableblock size is 16× 16
multiple of warp sizeone block will have reasonable 256 threadsone block needs 2 KB of shared memorythe memory will not limit the performance substantially(16 · 25.5 = 568 GFLOPS, which is quite close to 622 GFLOPS)
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Algorithm
Algorithm schema
each thread block will have As and Bs in the shared memory
tiles of A and B matrices will be multiplied iteratively, theresults will get accumulated in Csub variable
threads in a block read tiles into As and Bs cooperativelyeach thread mutliplies rows in As and columns in Bs for itselement of Csub matrix
each thread stores one element of the matrix into the matrixC in global memory
Beware of synchronization
the blocks need to be read completely before themultiplication starts
before we read new blocks, operation on previous data needsto be completed
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Second Kernel
__global__ void mmul ( float ∗A , float ∗B , float ∗C , int n ){int bx = blockIdx . x ;int by = blockIdx . y ;int tx = threadIdx . x ;int ty = threadIdx . y ;__shared__ float As [ BLOCK_SIZE ] [ BLOCK_SIZE ] ;__shared__ float Bs [ BLOCK_SIZE ] [ BLOCK_SIZE ] ;
float Csub = 0.0 f ;for ( int b = 0 ; b < n/BLOCK_SIZE ; b++){
As [ ty ] [ tx ] = A [ ( ty + by∗BLOCK_SIZE )∗ n + b∗BLOCK_SIZE+tx ] ;Bs [ ty ] [ tx ] = B [ ( ty + b∗BLOCK_SIZE )∗ n + bx∗BLOCK_SIZE+tx ] ;__syncthreads ( ) ;
for ( int k = 0 ; k < BLOCK_SIZE ; k++)Csub += As [ ty ] [ k ]∗ Bs [ k ] [ tx ] ;
__syncthreads ( ) ;}
C [ ( ty + by∗BLOCK )∗ n + bx∗BLOCK_SIZE+tx ] = Csub ;}
Jirı Filipovic GPU Architecture and Programming Model
GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication
Performance
theoretical limitation of first kernel is 66.8 GFLOPS, measuredperformance is 36.6 GFLOPS
theoretical limitation of first kernel is 568 GFLOPS, measuredperformance is 198 GFLOPS
how to get closer to the maximum performance of the card?
we need to understand HW and its limitation better andoptimize the algorithms accordingly
topics for the next lectures
Jirı Filipovic GPU Architecture and Programming Model