![Page 1: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/1.jpg)
NVIDIA CUDASeminar on Multi-core Programming
Feb 26, 2009
![Page 2: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/2.jpg)
Hello, in parallel!
![Page 3: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/3.jpg)
__global__voidfunctinttid=threadI__shared__floatif(tid<n){inttmp=tid
Outline
Introductionto CUDA
ProgrammingBasics
Planningfor CUDA
![Page 4: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/4.jpg)
IntroductionGPGPU = General Purpose computing on GPU
![Page 5: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/5.jpg)
“ (Gordon Moore)3 ”
2003 2004 2005 2006 2007 2008
GT200933 Gflops
G80
G703.0 GHz
Core2 Duo
3.2 GHzHarpertown
![Page 6: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/6.jpg)
Memory Bandwidth100 GB/s = 12.5 Gfloat read-writes/s
100
02003 2004 2005 2006 2007
GB/s
80
60
40
20
![Page 7: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/7.jpg)
First, there was just a graphics pipeline
![Page 8: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/8.jpg)
What could they do with itby reading from textures and writing to others
?
![Page 9: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/9.jpg)
Stream mapping
OP
![Page 10: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/10.jpg)
Parallel reduction
OP OP
![Page 11: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/11.jpg)
Gather input from textures
OP
![Page 12: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/12.jpg)
Scatter output as vertices
OP
![Page 13: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/13.jpg)
Map Reduce
Gather Scatter
![Page 14: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/14.jpg)
Back to present…
What to use GPU for?
![Page 15: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/15.jpg)
What to use GPU for?
Physics Simulations
![Page 16: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/16.jpg)
What to use GPU for?
Linear Algebra
Finance Pattern Recognition…
![Page 17: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/17.jpg)
What to use GPU for?
Biomedical Imaging
![Page 18: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/18.jpg)
Do you have a casewhere CUDA could be used?
![Page 19: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/19.jpg)
CUDA Architecture
![Page 20: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/20.jpg)
Yet Another...
S ingle I nstruction M ultiple T hreads
Warp = 32 threads, lock-step, masked
Chapter 2. Programming Model
!
10 CUDA Programming Guide Version 2.1!
Figure 2-1. Grid of Thread Blocks
2.3 Memory Hierarchy
"#$%!&'()*+,!-*.!*//),,!+*&*!0(1-!-23&453)!-)-1(.!,5*/),!+2(467!&')4(!)8)/2&416!*,!4332,&(*&)+!9.!:472()!;<;=!>*/'!&'()*+!'*,!*!5(4?*&)!31/*3!-)-1(.=!>*/'!&'()*+!931/@!'*,!*!,'*()+!-)-1(.!?4,493)!&1!*33!&'()*+,!10!&')!931/@!*6+!A4&'!&')!,*-)!340)&4-)!*,!&')!931/@=!:46*33.B!*33!&'()*+,!'*?)!*//),,!&1!&')!,*-)!7319*3!-)-1(.=!
C')()!*()!*3,1!&A1!*++4&416*3!()*+<163.!-)-1(.!,5*/),!*//),,493)!9.!*33!&'()*+,D!&')!/16,&*6&!*6+!&)8&2()!-)-1(.!,5*/),=!C')!7319*3B!/16,&*6&B!*6+!&)8&2()!-)-1(.!,5*/),!*()!15&4-4E)+!01(!+400)()6&!-)-1(.!2,*7),!F,))!G)/&416,!H=I=;=IB!H=I=;=JB!*6+!H=I=;=KL=!C)8&2()!-)-1(.!*3,1!100)(,!+400)()6&!*++(),,467!-1+),B!*,!A)33!*,!+*&*!043&)(467B!01(!,1-)!,5)/404/!+*&*!01(-*&,!F,))!G)/&416!K=J=KL=!
C')!7319*3B!/16,&*6&B!*6+!&)8&2()!-)-1(.!,5*/),!*()!5)(,4,&)6&!*/(1,,!@)(6)3!3*26/'),!9.!&')!,*-)!*5534/*&416=!
Grid
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block (2, 1) Block (1, 1) Block (0, 1)
Block (2, 0) Block (1, 0) Block (0, 0)
![Page 21: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/21.jpg)
Abstraction
1) You won’t know which core or when.
2) You don’t care how many cores.
3) Forget synchronization if you can.
![Page 22: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/22.jpg)
GeForce, Quadro or Tesla?
![Page 23: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/23.jpg)
Compute Capability?
1.0 – 1st generation (Nov. 2006)
1.1 – Atomics & asynchronous memory transfers
1.2 – Relaxed alignment requirements, voting intrinsics
1.3 – Double support (on 1/8th float speed)
![Page 24: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/24.jpg)
CUDA code can be compiled todifferent architectures,
including multi-core CPU’s.
PTX is an intermediate language between CUDA and CUBIN
![Page 25: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/25.jpg)
IntroductionGPGPU = General Purpose computing on GPU
![Page 26: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/26.jpg)
__global__voidfunctinttid=threadI__shared__floatif(tid<n){inttmp=tid
ProgrammingWhat’s needed to run GPGPU with CUDA?
![Page 27: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/27.jpg)
CUDA is close to C or C++01enum{max_coeff=128;};02__constant__floatcoeff[max_coeff];0304__device__floateval(floatx,intorder){05floatr=0;06for(inti=order;i>=0;‐‐i)07r=(r+x)*coeff[i];08returnr;09}10__global__voidpolynomial(floatconst*x,float*y,11intn,intorder)//order<max_coeff12{13inti=threadIdx.x+blockIdx.x*blockDim.x;14if(i<n)15y[i]=eval(x[i],order);16}
![Page 28: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/28.jpg)
But GPU ≠ CPU...
![Page 29: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/29.jpg)
GPU ≠ CPUGPU isn’t independent. CPU takes the initiative:01voidcalculate_polynomial(floatconst*a,floatconst*x,02float*y,intn,intorder)03{04float*z=0;05cudaMalloc((void**)&z,n*sizeof(float));06cudaMemcpy(z,x,n*sizeof(float),cudaMemcpyHostToDevice);07cudaMemcpyToSymbol(coeff,a,order*sizeof(float),0,08cudaMemcpyHostToDevice);09unsignedblock=256;10unsignedgrid=(n+block‐1)/block;11polynomial<<<grid,block>>>(z,z,n,order);//kernellaunch12cudaMemcpy(y,z,n*sizeof(float),cudaMemcpyDeviceToHost);13cudaFree(z);14}
![Page 30: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/30.jpg)
GPU ≠ CPU
GPU cannot access CPU memoryData must be explicitly transferred.
![Page 31: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/31.jpg)
There’s no...
stack
recursion
function pointers or calls
dynamic memory allocation (from device)
GPU ≠ CPU
![Page 32: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/32.jpg)
thread: runs the kernel with given thread index
warp: 32 threads in lock-step
block: max. 512 threads with shared cache, block-level synchronization: __syncthreads()
grid: 100’s or 1000’s of blocks; no synchronization
device: kernel-level synchronizationhost: enqueues kernel calls for device
GPU ≠ CPU
![Page 33: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/33.jpg)
There’s hardly any silicon spent on a GPU cache
GPU ≠ CPU
constant memorysame address for threads
textures2D, read-only
shared memoryread-write, within block
! Chapter 1. Introduction
!
CUDA Programming Guide Version 2.1 3!
!
!
Figure 1-2. The GPU Devotes More Transistors to Data Processing
!
"#$%!&'%()*)(+,,-.!/0%!123!)&!%&'%()+,,-!4%,,5&6)/%7!/#!+77$%&&!'$#8,%9&!/0+/!(+:!8%!%;'$%&&%7!+&!7+/+5'+$+,,%,!(#9'6/+/)#:&!<!/0%!&+9%!'$#=$+9!)&!%;%(6/%7!#:!9+:-!7+/+!%,%9%:/&!):!'+$+,,%,!<!4)/0!0)=0!+$)/09%/)(!):/%:&)/-!<!/0%!$+/)#!#*!+$)/09%/)(!#'%$+/)#:&!/#!9%9#$-!#'%$+/)#:&>!?%(+6&%!/0%!&+9%!'$#=$+9!)&!%;%(6/%7!*#$!%+(0!7+/+!%,%9%:/.!/0%$%!)&!+!,#4%$!$%@6)$%9%:/!*#$!&#'0)&/)(+/%7!*,#4!(#:/$#,A!+:7!8%(+6&%!)/!)&!%;%(6/%7!#:!9+:-!7+/+!%,%9%:/&!+:7!0+&!0)=0!+$)/09%/)(!):/%:&)/-.!/0%!9%9#$-!+((%&&!,+/%:(-!(+:!8%!0)77%:!4)/0!(+,(6,+/)#:&!):&/%+7!#*!8)=!7+/+!(+(0%&>!
B+/+5'+$+,,%,!'$#(%&&):=!9+'&!7+/+!%,%9%:/&!/#!'+$+,,%,!'$#(%&&):=!/0$%+7&>!"+:-!+'',)(+/)#:&!/0+/!'$#(%&&!,+$=%!7+/+!&%/&!(+:!6&%!+!7+/+5'+$+,,%,!'$#=$+99):=!9#7%,!/#!&'%%7!6'!/0%!(#9'6/+/)#:&>!C:!DB!$%:7%$):=.!,+$=%!&%/&!#*!');%,&!+:7!E%$/)(%&!+$%!9+''%7!/#!'+$+,,%,!/0$%+7&>!F)9),+$,-.!)9+=%!+:7!9%7)+!'$#(%&&):=!+'',)(+/)#:&!&6(0!+&!'#&/5'$#(%&&):=!#*!$%:7%$%7!)9+=%&.!E)7%#!%:(#7):=!+:7!7%(#7):=.!)9+=%!&(+,):=.!&/%$%#!E)&)#:.!+:7!'+//%$:!$%(#=:)/)#:!(+:!9+'!)9+=%!8,#(G&!+:7!');%,&!/#!'+$+,,%,!'$#(%&&):=!/0$%+7&>!C:!*+(/.!9+:-!+,=#$)/09&!#6/&)7%!/0%!*)%,7!#*!)9+=%!$%:7%$):=!+:7!'$#(%&&):=!+$%!+((%,%$+/%7!8-!7+/+5'+$+,,%,!'$#(%&&):=.!*$#9!=%:%$+,!&)=:+,!'$#(%&&):=!#$!'0-&)(&!&)96,+/)#:!/#!(#9'6/+/)#:+,!*):+:(%!#$!(#9'6/+/)#:+,!8)#,#=->!
1.2 CUDA™: a General-Purpose Parallel Computing Architecture
C:!H#E%98%$!IJJK.!HLCBCM!):/$#76(%7!N3BMO.!+!=%:%$+,!'6$'#&%!'+$+,,%,!(#9'6/):=!+$(0)/%(/6$%!<!4)/0!+!:%4!'+$+,,%,!'$#=$+99):=!9#7%,!+:7!):&/$6(/)#:!&%/!+$(0)/%(/6$%!<!/0+/!,%E%$+=%&!/0%!'+$+,,%,!(#9'6/%!%:=):%!):!HLCBCM!123&!/#!&#,E%!9+:-!(#9',%;!(#9'6/+/)#:+,!'$#8,%9&!):!+!9#$%!%**)()%:/!4+-!/0+:!#:!+!N23>!
N3BM!(#9%&!4)/0!+!&#*/4+$%!%:E)$#:9%:/!/0+/!+,,#4&!7%E%,#'%$&!/#!6&%!N!+&!+!0)=05,%E%,!'$#=$+99):=!,+:=6+=%>!M&!),,6&/$+/%7!8-!P)=6$%!Q5D.!#/0%$!,+:=6+=%&!#$!+'',)(+/)#:!'$#=$+99):=!):/%$*+(%&!4),,!8%!&6''#$/%7!):!/0%!*6/6$%.!&6(0!+&!PRSTSMH.!NUU.!R'%:NV.!+:7!B)$%(/DB!QQ!N#9'6/%>!
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
![Page 34: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/34.jpg)
NVCC separates,
compiles &
embeds GPU code
nvccmy_program.cu‐omy_program
![Page 35: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/35.jpg)
NVCC separates,
compiles &
embeds GPU code
nvccmy_program.cu‐omy_program
host code(C or C++)
GPU functions(cu)
![Page 36: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/36.jpg)
NVCC separates,
compiles &
embeds GPU code
nvccmy_program.cu‐omy_program
host code(C or C++)
GPU functions(cu)
GPU kernels(ptx)
GPU kernels(cubin)
![Page 37: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/37.jpg)
NVCC separates,
compiles &
embeds GPU code
nvccmy_program.cu‐omy_program
host code(C or C++)
GPU functions(cu)
GPU kernels(ptx)
GPU kernels(cubin)
![Page 38: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/38.jpg)
nvccmy_program.cu‐omy_program
host code(C or C++)
GPU functions(cu)
GPU kernels(ptx)
GPU kernels(cubin)
Runtime API vs. Driver API?
![Page 39: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/39.jpg)
Where to feed input from CPU?CPU GPU
RAM
Global
Texture
Constant
KERNEL
uncached read/writealignment (coalescing)
2D spatial cache, read-only1D buffer texturing
read same address at a time
![Page 40: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/40.jpg)
Where to feed input from CPU?GPU
Global
KERNEL
uncached read/writealignment (coalescing)
cudaMalloc(&dptr,n);cudaFree(dptr);cudaMemcpy(dptr,p,n,cudaMemcpyHostToDevice);
![Page 41: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/41.jpg)
Threads have specialized memory
Global
Texture
Constant
KERNEL
registersthread-private,
usually only ~10’sper thread
local memorythread-privateglobal memory,
spilled-over registers
__shared__between block threads,limited size < 16 KB/MP,
16 banks (half-warp),broadcast function
![Page 42: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/42.jpg)
__global__voidfunctinttid=threadI__shared__floatif(tid<n){inttmp=tid
ProgrammingWhat’s needed to run GPGPU with CUDA?
![Page 43: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/43.jpg)
Four steps to CUDA performance
Planning
![Page 44: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/44.jpg)
Four steps to performance
1. REDESIGNyour algorithm
2. RESTRUCTUREdata
3. COOPERATEwith block threads
4. SQUEEZEthe last juice out
![Page 45: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/45.jpg)
1. Redesign
• redesign algorithm for GPU and big datamost C code won’t copy-paste to CUDA
• maximize parallel executiongo data parallel, with 10000’s threads
• find arithmetic intensity (flops / transfers)cache less, compute more; 1 global load ≈ 100 flops
• don’t leave the MP’s unemployed
![Page 46: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/46.jpg)
2. Restructure
• strive for coherent global memory accessesit’s a matter of 100 GB/s vs. 10 GB/s
• access locality? could textures help?1D buffer textures read directly from global memory
• prevent CPU roundtripsGPU–CPU: ~5 GB/s; group transfers if possible
![Page 47: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/47.jpg)
3. Cooperate• block threads talk through shared memory
save global memory loads when gathering
• use __syncthreads() if neededbut go lock-step within warps
• prevent cache bank conflictshalf-warp threads read different banks, or broadcast
• warp voting?
![Page 48: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/48.jpg)
4. Squeeze
• parameterize your applicationauto-tuning algorithms?
• minimize registers and shared memory
• loop unrolling and template tricksthey help as long as GPU architecture differs from CPU’s
![Page 49: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/49.jpg)
Four steps to performance
1. REDESIGNyour algorithm
2. RESTRUCTUREdata
3. COOPERATEwith block threads
4. SQUEEZEthe last juice out
![Page 50: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/50.jpg)
Examples
![Page 51: NVIDIA CUDA - Aalto · 2009. 3. 4. · Seminar on Multi-core Programming Feb 26, 2009. Hello, in parallel! __global__ void funct int tid ... 03 04 __device__ ... recursion function](https://reader034.vdocuments.net/reader034/viewer/2022052012/6028b800de878d4881182dd8/html5/thumbnails/51.jpg)
Questions