lecture 2: gpu history & cuda programming basics
DESCRIPTION
Lecture 2: GPU History & CUDA Programming Basics. Outline of CUDA Basics. Basic Kernels and Execution on GPU Basic Memory Management Coordinating CPU and GPU Execution See the Programming Guide for the full API. Basic Kernels and Execution on GPU. CUDA Programming Model. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/1.jpg)
CS 193G
Lecture 2: GPU History & CUDA Programming Basics
![Page 2: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/2.jpg)
Outline of CUDA Basics
Basic Kernels and Execution on GPU
Basic Memory Management
Coordinating CPU and GPU Execution
See the Programming Guide for the full API
![Page 3: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/3.jpg)
BASIC KERNELS AND EXECUTION ON GPU
![Page 4: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/4.jpg)
CUDA Programming Model
Parallel code (kernel) is launched and executed on a device by many threads
Launches are hierarchicalThreads are grouped into blocks
Blocks are grouped into grids
Familiar serial code is written for a threadEach thread is free to execute a unique code path
Built-in thread and block ID variables
![Page 5: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/5.jpg)
High Level View
Global MemoryCPU
ChipsetPCIe
![Page 6: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/6.jpg)
Blocks of threads run on an SM
Thread
Memory
Threadblock
Per-blockSharedMemory
Streaming Processor Streaming Multiprocessor
Registers
Memory
![Page 7: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/7.jpg)
Whole grid runs on GPU
Many blocks of threads
. . .
Global Memory
![Page 8: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/8.jpg)
Thread Hierarchy
Threads launched for a parallel section are partitioned into thread blocks
Grid = all blocks for a given launch
Thread block is a group of threads that can:Synchronize their execution
Communicate via shared memory
![Page 9: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/9.jpg)
Memory Model
Kernel 0
. . .Per-device
GlobalMemory
. . .
Kernel 1
SequentialKernels
![Page 10: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/10.jpg)
Memory Model
Device 0memory
Device 1memory
Host memory cudaMemcpy()
![Page 11: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/11.jpg)
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run grid of N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Device Code
![Page 12: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/12.jpg)
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run grid of N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Host Code
![Page 13: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/13.jpg)
Example: Host code for vecAdd
// allocate and initialize host (CPU) memoryfloat *h_A = …, *h_B = …; *h_C = …(empty)
// allocate device (GPU) memoryfloat *d_A, *d_B, *d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) );cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute grid of N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
![Page 14: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/14.jpg)
Example: Host code for vecAdd (2)
// execute grid of N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
// copy result back to host memorycudaMemcpy( h_C, d_C, N * sizeof(float),
cudaMemcpyDeviceToHost) );
// do something with the result…
// free device (GPU) memorycudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
![Page 15: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/15.jpg)
__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7;}
__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x;}
__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x;}
Kernel Variations and Output
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
![Page 16: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/16.jpg)
Code executed on GPU
C/C++ with some restrictions:Can only access GPU memoryNo variable number of argumentsNo static variablesNo recursionNo dynamic polymorphism
Must be declared with a qualifier:__global__ : launched by CPU,
cannot be called from GPU must return void__device__ : called from other GPU functions,
cannot be called by the CPU__host__ : can be called by CPU__host__ and __device__ qualifiers can be combined
sample use: overloading operators
![Page 17: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/17.jpg)
Memory Spaces
CPU and GPU have separate memory spacesData is moved across PCIe bus
Use functions to allocate/set/copy memory on GPU
Very similar to corresponding C functions
Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPU
Must exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crash
Same for vice versa
![Page 18: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/18.jpg)
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
![Page 19: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/19.jpg)
Data Copies
cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
returns after the copy is completeblocks CPU thread until all bytes have been copieddoesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice
Non-blocking copies are also available
![Page 20: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/20.jpg)
Code Walkthrough 1// walkthrough1.cu#include <stdio.h>
int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
![Page 21: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/21.jpg)
Code Walkthrough 1// walkthrough1.cu#include <stdio.h>
int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
![Page 22: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/22.jpg)
Code Walkthrough 1// walkthrough1.cu#include <stdio.h>
int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
![Page 23: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/23.jpg)
Code Walkthrough 1// walkthrough1.cu#include <stdio.h>
int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n");
free( h_a ); cudaFree( d_a );
return 0;}
![Page 24: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/24.jpg)
Example: Shuffling Data
// Reorder values based on keys
// Each thread moves one element
__global__ void shuffle(int* prev_array, int* new_array, int* indices)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
new_array[i] = prev_array[indices[i]];
}
int main()
{
// Run grid of N/256 blocks of 256 threads each
shuffle<<< N/256, 256>>>(d_old, d_new, d_ind);
}
Host Code
![Page 25: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/25.jpg)
IDs and Dimensions
Threads:3D IDs, unique within a block
Blocks:2D IDs, unique within a grid
Dimensions set at launch Can be unique for each grid
Built-in variables:threadIdx, blockIdx
blockDim, gridDim
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
![Page 26: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/26.jpg)
__global__ void kernel( int *a, int dimx, int dimy ){ int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;
a[idx] = a[idx]+1;}
Kernel with 2D Indexing
![Page 27: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/27.jpg)
int main(){ int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
cudaMemset( d_a, 0, num_bytes );
dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y;
kernel<<<grid, block>>>( d_a, dimx, dimy );
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); }
free( h_a ); cudaFree( d_a );
return 0;}
__global__ void kernel( int *a, int dimx, int dimy ){ int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;
a[idx] = a[idx]+1;}
![Page 28: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/28.jpg)
Blocks must be independent
Any possible interleaving of blocks should be valid
presumed to run to completion without pre-emption
can run in any order
can run concurrently OR sequentially
Blocks may coordinate but not synchronizeshared queue pointer: OK
shared lock: BAD … can easily deadlock
Independence requirement gives scalability
![Page 29: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/29.jpg)
Questions?
![Page 30: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/30.jpg)
CS 193G
History of GPUs
![Page 31: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/31.jpg)
Graphics in a NutshellMake great images
intricate shapes
complex optical effects
seamless motion
Make them fastinvent clever techniques
use every trick imaginable
build monster hardware
Eugene d’Eon, David Luebke, Eric EndertonIn Proc. EGSR 2007 and GPU Gems 3
![Page 32: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/32.jpg)
The Graphics Pipeline
Vertex Transform & LightingVertex Transform & Lighting
Triangle Setup & RasterizationTriangle Setup & Rasterization
Texturing & Pixel ShadingTexturing & Pixel Shading
Depth Test & BlendingDepth Test & Blending
FramebufferFramebuffer
![Page 33: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/33.jpg)
The Graphics Pipeline
Vertex Transform & LightingVertex Transform & Lighting
Triangle Setup & RasterizationTriangle Setup & Rasterization
Texturing & Pixel ShadingTexturing & Pixel Shading
Depth Test & BlendingDepth Test & Blending
FramebufferFramebuffer
![Page 34: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/34.jpg)
The Graphics Pipeline
Vertex Transform & LightingVertex Transform & Lighting
Triangle Setup & RasterizationTriangle Setup & Rasterization
Texturing & Pixel ShadingTexturing & Pixel Shading
Depth Test & BlendingDepth Test & Blending
FramebufferFramebuffer
![Page 35: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/35.jpg)
The Graphics Pipeline
Vertex Transform & LightingVertex Transform & Lighting
Triangle Setup & RasterizationTriangle Setup & Rasterization
Texturing & Pixel ShadingTexturing & Pixel Shading
Depth Test & BlendingDepth Test & Blending
FramebufferFramebuffer
![Page 36: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/36.jpg)
The Graphics Pipeline
Key abstraction of real-time graphics
Hardware used to look like this
One chip/board per stage
Fixed data flow through pipeline
VertexVertex
RasterizeRasterize
PixelPixel
Test & BlendTest & Blend
FramebufferFramebuffer
![Page 37: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/37.jpg)
The Graphics Pipeline
Everything fixed function, with a certain number of modes
Number of modes for each stage grew over time
Hard to optimize HW
Developers always wanted more flexibility
VertexVertex
RasterizeRasterize
PixelPixel
Test & BlendTest & Blend
FramebufferFramebuffer
![Page 38: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/38.jpg)
The Graphics Pipeline
Remains a key abstraction
Hardware used to look like this
Vertex & pixel processing became programmable, new stages added
GPU architecture increasingly centers around shader execution
VertexVertex
RasterizeRasterize
PixelPixel
Test & BlendTest & Blend
FramebufferFramebuffer
![Page 39: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/39.jpg)
The Graphics Pipeline
Exposing a (at first limited) instruction set for some stages
Limited instructions & instruction types and no control flow at first
Expanded to full ISA
VertexVertex
RasterizeRasterize
PixelPixel
Test & BlendTest & Blend
FramebufferFramebuffer
![Page 40: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/40.jpg)
Why GPUs scale so nicely
Workload and Programming Model provide lots of parallelism
Applications provide large groups of vertices at once
Vertices can be processed in parallel
Apply same transform to all vertices
Triangles contain many pixelsPixels from a triangle can be processed in parallel
Apply same shader to all pixels
Very efficient hardware to hide serialization bottlenecks
![Page 41: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/41.jpg)
With Moore’s Law…
Raster
Vertex
Pixel
Blend
Ras
ter
VertexPixel 0
Ble
ndPixel 1
Pixel 2
Pixel 3
Vrtx 0
Vrt
x 2
Vrt
x 1
![Page 42: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/42.jpg)
More Efficiency
Note that we do the same thing for lots of pixels/vertices
ALU
Control
ALU
Control
ALU
Control
ALU
Control
ALU
Control
ALU
Control
ALU ALU ALU
Control
ALU ALU ALU
A warp = 32 threads launched together
Usually, execute together as well
![Page 43: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/43.jpg)
Early GPGPU
All this performance attracted developers
To use GPUs, re-expressed their algorithms as graphics computations
Very tedious, limited usability
Still had some very nice results
This was the lead up to CUDA
![Page 44: Lecture 2: GPU History & CUDA Programming Basics](https://reader036.vdocuments.net/reader036/viewer/2022062322/56814e54550346895dbbe687/html5/thumbnails/44.jpg)
Questions?