cuda c/c++ basicscc.sjtu.edu.cn/upload/20171012153914920.pdf · introduction to cuda c/c++ •what...
TRANSCRIPT
CUDAC/C++BASICS
© NVIDIA 2013
WhatisCUDA?• CUDAArchitecture
– ExposeGPUparallelismforgeneral-purposecomputing– Retainperformance
• CUDAC/C++– Basedonindustry-standardC/C++– Smallsetofextensionstoenableheterogeneousprogramming
– StraightforwardAPIstomanagedevices,memoryetc.
• ThissessionintroducesCUDAC/C++
© NVIDIA 2013
IntroductiontoCUDAC/C++
• Whatwillyoulearninthissession?– Startfrom“HelloWorld!”– WriteandlaunchCUDAC/C++kernels– ManageGPUmemory– Managecommunicationandsynchronization
© NVIDIA 2013
Prerequisites
• You(probably)needexperiencewithCorC++
• Youdon’tneedGPUexperience
• Youdon’tneedparallelprogrammingexperience
• Youdon’tneedgraphicsexperience
© NVIDIA 2013
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
HELLO WORLD!
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
HeterogeneousComputing
§ Terminology:§ Host The CPU and its memory (host memory)§ Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013
HeterogeneousComputing#include <iostream>#include <algorithm>
using namespace std;
#define N 1024#define RADIUS 3#define BLOCK_SIZE 16
__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)__syncthreads();
// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the resultout[gindex] = result;
}
void fill_ints(int *x, int n) {fill_n(x, n, 1);
}
int main(void) {int *in, *out; // host copies of a, b, cint *d_in, *d_out; // device copies of a, b, cint size = (N + 2*RADIUS) * sizeof(int);
// Alloc space for host copies and setup valuesin = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);
// Alloc space for device copiescudaMalloc((void **)&d_in, size);cudaMalloc((void **)&d_out, size);
// Copy to devicecudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
// Launch stencil_1d() kernel on GPUstencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);
// Copy result back to hostcudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanupfree(in); free(out);cudaFree(d_in); cudaFree(d_out);return 0;
}
serial code
parallel code
serial code
parallel fn
© NVIDIA 2013
SimpleProcessingFlow
1. Copy input data from CPU memory to GPU memory
PCIBus
© NVIDIA 2013
SimpleProcessingFlow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
© NVIDIA 2013
PCIBus
SimpleProcessingFlow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
3. Copy results from GPU memory to CPU memory
© NVIDIA 2013
PCIBus
HelloWorld!int main(void) {
printf("Hello World!\n");return 0;
}
Standard C that runs on the host
NVIDIA compiler (nvcc) can be used to compile programs with no devicecode
Output:
$ nvcchello_world.cu$ a.outHello World!$
© NVIDIA 2013
HelloWorld!withDeviceCode__global__ void mykernel(void) {}
int main(void) {mykernel<<<1,1>>>();printf("Hello World!\n");return 0;
}
§ Two new syntactic elements…
© NVIDIA 2013
HelloWorld!withDeviceCode__global__ void mykernel(void) {}
• CUDAC/C++keyword__global__ indicatesafunctionthat:– Runsonthedevice– Iscalledfromhostcode
• nvcc separatessourcecodeintohostanddevicecomponents– Devicefunctions(e.g.mykernel())processedbyNVIDIAcompiler– Hostfunctions(e.g.main())processedbystandardhostcompiler
• gcc,cl.exe
© NVIDIA 2013
HelloWorld!withDeviceCOdemykernel<<<1,1>>>();
• Tripleanglebracketsmarkacallfromhostcodetodevice code– Alsocalleda“kernellaunch”– We’llreturntotheparameters(1,1)inamoment
• That’sallthatisrequiredtoexecuteafunctionontheGPU!
© NVIDIA 2013
HelloWorld!withDeviceCode__global__ void mykernel(void){}
int main(void) {mykernel<<<1,1>>>();printf("Hello World!\n");return 0;
}
• mykernel() does nothing, somewhat anticlimactic!
Output:
$ nvcchello.cu$ a.outHello World!$
© NVIDIA 2013
ParallelProgramminginCUDAC/C++
• But wait… GPU computing is about massive parallelism!
• We need a more interesting example…
• We’ll start by adding two integers and build up to vector addition
a b c
© NVIDIA 2013
AdditionontheDevice
• Asimplekerneltoaddtwointegers
__global__ void add(int *a, int *b, int *c) {*c = *a + *b;
}
• Asbefore__global__ isaCUDAC/C++keywordmeaning– add() willexecuteonthedevice– add() willbecalledfromthehost
© NVIDIA 2013
AdditionontheDevice
• Notethatweusepointersforthevariables
__global__ void add(int *a, int *b, int *c) {*c = *a + *b;
}
• add() runsonthedevice,soa,b andc mustpointtodevicememory
• WeneedtoallocatememoryontheGPU
© NVIDIA 2013
MemoryManagement
• Hostanddevicememoryareseparateentities– Device pointerspointtoGPUmemory
Maybepassedto/fromhostcodeMaynotbedereferencedinhostcode
– HostpointerspointtoCPUmemoryMaybepassedto/fromdevicecodeMaynotbedereferencedindevicecode
• SimpleCUDAAPIforhandlingdevicememory– cudaMalloc(),cudaFree(),cudaMemcpy()– SimilartotheCequivalentsmalloc(),free(),memcpy()
© NVIDIA 2013
AdditionontheDevice:add()
• Returningtoouradd() kernel
__global__ void add(int *a, int *b, int *c) {*c = *a + *b;
}
• Let’stakealookatmain()…
© NVIDIA 2013
AdditionontheDevice:main()int main(void) {
int a, b, c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = sizeof(int);
// Allocate space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);
// Setup input valuesa = 2;b = 7;
© NVIDIA 2013
AdditionontheDevice:main()// Copy inputs to devicecudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPUadd<<<1,1>>>(d_a, d_b, d_c);
// Copy result back to hostcudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// CleanupcudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;
}
© NVIDIA 2013
RUNNING IN PARALLEL
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
MovingtoParallel
• GPUcomputingisaboutmassiveparallelism– Sohowdoweruncodeinparallelonthedevice?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
• Insteadofexecutingadd() once,executeNtimesinparallel
© NVIDIA 2013
VectorAdditionontheDevice
• Withadd() runninginparallelwecandovectoraddition
• Terminology:eachparallelinvocationofadd() isreferredtoasablock– Thesetofblocksisreferredtoasagrid– EachinvocationcanrefertoitsblockindexusingblockIdx.x
__global__ void add(int *a, int *b, int *c) {c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• ByusingblockIdx.x toindexintothearray,eachblockhandlesadifferentindex
© NVIDIA 2013
VectorAdditionontheDevice__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}
• Onthedevice,eachblockcanexecuteinparallel:
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
Block0 Block1 Block2 Block3
© NVIDIA 2013
VectorAdditionontheDevice:add()
• Returningtoourparallelizedadd() kernel__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}
• Let’stakealookatmain()…
© NVIDIA 2013
VectorAdditionontheDevice:main()#define N 512int main(void) {
int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);
// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);
© NVIDIA 2013
VectorAdditionontheDevice:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocksadd<<<N,1>>>(d_a, d_b, d_c);
// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;
}
© NVIDIA 2013
Review(1of2)
• Differencebetweenhostanddevice– Host CPU– Device GPU
• Using__global__ todeclareafunctionasdevicecode– Executesonthedevice– Calledfromthehost
• Passingparametersfromhostcodetoadevicefunction
© NVIDIA 2013
Review(2of2)
• Basicdevicememorymanagement– cudaMalloc()– cudaMemcpy()– cudaFree()
• Launchingparallelkernels– LaunchN copiesofadd() withadd<<<N,1>>>(…);– UseblockIdx.x toaccessblockindex
© NVIDIA 2013
INTRODUCING THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
CUDAThreads
• Terminology:ablockcanbesplitintoparallelthreads
• Let’schangeadd() touseparallelthreads insteadofparallelblocks
• WeusethreadIdx.x insteadofblockIdx.x
• Needtomakeonechangeinmain()…
__global__ void add(int *a, int *b, int *c) {c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
© NVIDIA 2013
VectorAdditionUsingThreads:main()#define N 512int main(void) {
int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);
// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);
© NVIDIA 2013
VectorAdditionUsingThreads:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N threadsadd<<<1,N>>>(d_a, d_b, d_c);
// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;
}
© NVIDIA 2013
COMBINING THREADSAND BLOCKS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
CombiningBlocksandThreads
• We’veseenparallelvectoradditionusing:– Manyblockswithonethreadeach– Oneblockwithmanythreads
• Let’sadaptvectoradditiontousebothblocksandthreads
• Why?We’llcometothat…
• Firstlet’sdiscussdataindexing…
© NVIDIA 2013
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
IndexingArrayswithBlocksandThreads
• WithMthreads/blockauniqueindexforeachthreadisgivenby:
int index = threadIdx.x + blockIdx.x * M;
• NolongerassimpleasusingblockIdx.x and threadIdx.x
– Considerindexinganarraywithoneelementperthread(8threads/block)threadIdx.x threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3
© NVIDIA 2013
IndexingArrays:Example
• Whichthreadwilloperateontheredelement?
int index = threadIdx.x + blockIdx.x * M;= 5 + 2 * 8;= 21;
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
threadIdx.x = 5
blockIdx.x = 2
0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M = 8
© NVIDIA 2013
VectorAdditionwithBlocksandThreads
• Whatchangesneedtobemadeinmain()?
• Usethebuilt-invariableblockDim.x forthreadsperblock
int index = threadIdx.x + blockIdx.x * blockDim.x;
• Combinedversionofadd() touseparallelthreadsand parallelblocks
__global__ void add(int *a, int *b, int *c) {int index = threadIdx.x + blockIdx.x * blockDim.x;c[index] = a[index] + b[index];
}
© NVIDIA 2013
AdditionwithBlocksandThreads:main()#define N (2048*2048)#define THREADS_PER_BLOCK 512int main(void) {
int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);
// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);
© NVIDIA 2013
AdditionwithBlocksandThreads:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPUadd<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;
}
© NVIDIA 2013
HandlingArbitraryVectorSizes
• Updatethekernellaunch:add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
• TypicalproblemsarenotfriendlymultiplesofblockDim.x
• Avoidaccessingbeyondtheendofthearrays:__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;if (index < n)
c[index] = a[index] + b[index];}
© NVIDIA 2013
WhyBotherwithThreads?
• Threadsseemunnecessary– Theyaddalevelofcomplexity– Whatdowegain?
• Unlikeparallelblocks,threadshavemechanismsto:– Communicate– Synchronize
• Tolookcloser,weneedanewexample…
© NVIDIA 2013
Review
• Launchingparallelkernels– LaunchN copiesofadd() withadd<<<N/M,M>>>(…);– UseblockIdx.x toaccessblockindex– UsethreadIdx.x toaccessthreadindexwithinblock
• Allocateelementstothreads:
int index = threadIdx.x + blockIdx.x * blockDim.x;
© NVIDIA 2013
COOPERATING THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
1DStencil
• Considerapplyinga1Dstenciltoa1Darrayofelements– Eachoutputelementisthesumofinputelementswithinaradius
• Ifradiusis3,theneachoutputelementisthesumof7inputelements:
© NVIDIA 2013
radius radius
ImplementingWithinaBlock
• Eachthreadprocessesoneoutputelement– blockDim.xelementsperblock
• Inputelementsarereadseveraltimes– Withradius3,eachinputelementisreadseventimes
© NVIDIA 2013
SharingDataBetweenThreads
• Terminology:withinablock,threadssharedataviasharedmemory
• Extremelyfaston-chipmemory,user-managed
• Declareusing__shared__,allocatedperblock
• Dataisnotvisibletothreadsinotherblocks
© NVIDIA 2013
ImplementingWithSharedMemory
• Cachedatainsharedmemory– Read(blockDim.x +2*radius)inputelementsfromglobalmemorytosharedmemory
– ComputeblockDim.x outputelements– WriteblockDim.x outputelementstoglobalmemory
– Eachblockneedsahaloofradiuselementsateachboundary
blockDim.x output elements
halo on left halo on right
© NVIDIA 2013
__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
© NVIDIA 2013
StencilKernel
// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)result += temp[lindex + offset];
// Store the resultout[gindex] = result;
}
StencilKernel
© NVIDIA 2013
DataRace!
© NVIDIA 2013
§ Thestencilexamplewillnotwork…
§ Supposethread15readsthehalobeforethread0hasfetchedit…
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
result += temp[lindex + 1];
Store at temp[18]
Load from temp[19]
Skipped, threadIdx > RADIUS
__syncthreads()
• void __syncthreads();
• Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards
• Allthreadsmustreachthebarrier– Inconditionalcode,theconditionmustbeuniformacrosstheblock
© NVIDIA 2013
StencilKernel__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + radius;
// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)__syncthreads();
© NVIDIA 2013
StencilKernel// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the resultout[gindex] = result;
}
© NVIDIA 2013
Review(1of2)
• Launchingparallelthreads– LaunchN blockswithM threadsperblockwith
kernel<<<N,M>>>(…);
– UseblockIdx.x toaccessblockindexwithingrid– UsethreadIdx.x toaccessthreadindexwithinblock
• Allocateelementstothreads:
int index = threadIdx.x + blockIdx.x * blockDim.x;
© NVIDIA 2013
Review(2of2)
• Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks
• Use__syncthreads() asabarrier– Usetopreventdatahazards
© NVIDIA 2013
MANAGING THE DEVICE
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
CoordinatingHost&Device
• Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately
• CPUneedstosynchronizebeforeconsumingtheresultscudaMemcpy() BlockstheCPUuntilthecopyiscomplete
CopybeginswhenallprecedingCUDAcallshavecompleted
cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU
cudaDeviceSynchronize()
BlockstheCPUuntilallprecedingCUDAcallshavecompleted
© NVIDIA 2013
ReportingErrors
• AllCUDAAPIcallsreturnanerrorcode(cudaError_t)– ErrorintheAPIcallitself
OR– Errorinanearlierasynchronousoperation(e.g.kernel)
• Gettheerrorcodeforthelasterror:cudaError_t cudaGetLastError(void)
• Getastringtodescribetheerror:char *cudaGetErrorString(cudaError_t)
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
© NVIDIA 2013
DeviceManagement
• ApplicationcanqueryandselectGPUscudaGetDeviceCount(int *count)cudaSetDevice(int device)cudaGetDevice(int *device)cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Multiplethreadscanshareadevice
• AsinglethreadcanmanagemultipledevicescudaSetDevice(i) toselectcurrentdevicecudaMemcpy(…) forpeer-to-peercopies✝
✝ requires OS and device support
© NVIDIA 2013
IntroductiontoCUDAC/C++
• Whathavewelearned?– WriteandlaunchCUDAC/C++kernels
• __global__,blockIdx.x,threadIdx.x,<<<>>>
– ManageGPUmemory• cudaMalloc(),cudaMemcpy(),cudaFree()
– Managecommunicationandsynchronization• __shared__,__syncthreads()• cudaMemcpy() vs cudaMemcpyAsync(),cudaDeviceSynchronize()
© NVIDIA 2013
ComputeCapability• Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.
– Numberofregisters– Sizesofmemories– Features&capabilities
• ThefollowingpresentationsconcentrateonFermidevices– ComputeCapability>=2.0
ComputeCapability
SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)
Teslamodels
1.0 FundamentalCUDAsupport 870
1.3 Doubleprecision,improved memoryaccesses,atomics
10-series
2.0 Caches, fusedmultiply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,functionpointers,recursion
20-series
© NVIDIA 2013
IDsandDimensions– Akernelislaunchedasagridofblocksofthreads
• blockIdx andthreadIdx are3D
• Weshowedonlyonedimension(x)
• Built-invariables:– threadIdx– blockIdx– blockDim– gridDim
Device
Grid 1Bloc
k(0,0,
0)
Block
(1,0,0)
Block
(2,0,0)
Block
(1,1,0)
Block
(2,1,0)
Block
(0,1,0)
Block (1,1,0)Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
Thread
(4,0,0)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)
Thread
(0,2,0)
Thread
(1,2,0)
Thread
(2,2,0)
Thread
(3,2,0)
Thread
(4,2,0)
© NVIDIA 2013
Textures
• Read-onlyobject– Dedicatedcache
• Dedicatedfilteringhardware(Linear,bilinear,trilinear)
• Addressableas1D,2Dor3D
• Out-of-boundsaddresshandling(Wrap,clamp)
0 1 2 30
1
2
4
(2.5, 0.5)
(1.0, 1.0)
© NVIDIA 2013
Topicsweskipped
• Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide– CUDAZone– tools,training,webinarsandmore
developer.nvidia.com/cuda
• Needaquickprimerforlater:– Multi-dimensionalindexing– Textures
© NVIDIA 2013