introduction to cuda (1 of 2) patrick cozzi university of pennsylvania cis 565 - spring 2012
TRANSCRIPT
![Page 1: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/1.jpg)
Introduction to CUDA (1 of 2)
Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2012
![Page 2: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/2.jpg)
Announcements
IBM Almaden
Readings listed on our website Clone your git repository
Image from http://www.almaden.ibm.com/
![Page 3: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/3.jpg)
Acknowledgements
Many slides are from David Kirk and Wen-mei Hwu’s UIUC course:
http://courses.engr.illinois.edu/ece498/al/
![Page 4: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/4.jpg)
Agenda
Parallelism Review GPU Architecture Review CUDA
![Page 5: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/5.jpg)
Parallelism Review
Pipeline ParallelPipelined processorsGraphics pipeline
![Page 6: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/6.jpg)
Parallelism Review
Task ParallelSpell checkerGame enginesVirtual globes
Image from: http://www.gamasutra.com/view/feature/2463/threading_3d_game_engine_basics.php
![Page 7: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/7.jpg)
Parallelism Review
Data ParallelCloth simulationParticle systemMatrix multiply
Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306
![Page 8: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/8.jpg)
Matrix Multiply Reminder
Vectors Dot products Row major or column major? Dot product per output element
![Page 9: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/9.jpg)
GPU Architecture Review
GPUs are:ParallelMultithreadedMany-core
GPUs have:Tremendous computational horsepowerHigh memory bandwidth
![Page 10: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/10.jpg)
GPU Architecture Review
GPUs are specialized forCompute-intensive, highly parallel computationGraphics!
Transistors are devoted to:ProcessingNot:
Data caching Flow control
![Page 11: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/11.jpg)
GPU Architecture Review
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Transistor Usage
![Page 12: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/12.jpg)
Let’s program this thing!
![Page 13: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/13.jpg)
GPU Computing History
2001/2002 – researchers see GPU as data-parallel coprocessorThe GPGPU field is born
2007 – NVIDIA releases CUDACUDA – Compute Uniform Device ArchitectureGPGPU shifts to GPU Computing
2008 – Khronos releases OpenCL specification
![Page 14: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/14.jpg)
CUDA Abstractions
A hierarchy of thread groups Shared memories Barrier synchronization
![Page 15: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/15.jpg)
CUDA Terminology
Host – typically the CPUCode written in ANSI C
Device – typically the GPU (data-parallel)Code written in extended ANSI C
Host and device have separate memories CUDA Program
Contains both host and device code
![Page 16: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/16.jpg)
CUDA Terminology
Kernel – data-parallel function Invoking a kernel creates lightweight threads
on the device Threads are generated and scheduled with
hardware
Similar to a shader in OpenGL?
![Page 17: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/17.jpg)
CUDA Kernels
Executed N times in parallel by N different CUDA threads
Thread ID
ExecutionConfiguration
DeclarationSpecifier
![Page 18: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/18.jpg)
CUDA Program Execution
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 19: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/19.jpg)
Thread Hierarchies
Grid – one or more thread blocks1D or 2D
Block – array of threads1D, 2D, or 3DEach block in a grid has the same number of
threadsEach thread in a block can
Synchronize Access shared memory
![Page 20: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/20.jpg)
Thread Hierarchies
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 21: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/21.jpg)
Thread Hierarchies
Block – 1D, 2D, or 3DExample: Index into vector, matrix, volume
![Page 22: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/22.jpg)
Thread Hierarchies
Thread ID: Scalar thread identifier Thread Index: threadIdx
1D: Thread ID == Thread Index 2D with size (Dx, Dy)
Thread ID of index (x, y) == x + y Dy
3D with size (Dx, Dy, Dz)Thread ID of index (x, y, z) == x + y Dy + z Dx Dy
![Page 23: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/23.jpg)
Thread Hierarchies
1 Thread Block 2D Block
2D Index
![Page 24: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/24.jpg)
Thread Hierarchies
Thread BlockGroup of threads
G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads
Reside on same processor coreShare memory of that core
![Page 25: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/25.jpg)
Thread Hierarchies
Thread BlockGroup of threads
G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads
Reside on same processor coreShare memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 26: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/26.jpg)
Thread Hierarchies
Block Index: blockIdx Dimension: blockDim
1D or 2D
![Page 27: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/27.jpg)
Thread Hierarchies
2D Thread Block
16x16Threads per block
![Page 28: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/28.jpg)
Thread Hierarchies
Example: N = 3216x16 threads per block (independent of N)
threadIdx ([0, 15], [0, 15])
2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16
i = [0, 1] * 16 + [0, 15]
![Page 29: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/29.jpg)
Thread Hierarchies
Thread blocks execute independently In any order: parallel or seriesScheduled in any order by any number of
cores Allows code to scale with core count
![Page 30: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/30.jpg)
Thread Hierarchies
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
![Page 31: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/31.jpg)
Thread Hierarchies
Threads in a blockShare (limited) low-latency memorySynchronize execution
To coordinate memory accesses __syncThreads()
Barrier – threads in block wait until all threads reach this Lightweight
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 32: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/32.jpg)
CUDA Memory Transfers
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 33: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/33.jpg)
CUDA Memory Transfers
Host can transfer to/from deviceGlobal memoryConstant memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 34: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/34.jpg)
CUDA Memory Transfers
cudaMalloc()Allocate global memory on device
cudaFree()Frees memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 35: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/35.jpg)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 36: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/36.jpg)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Pointer to device memory
![Page 37: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/37.jpg)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Size in bytes
![Page 38: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/38.jpg)
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
Similar to buffer objects in OpenGL
![Page 39: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/39.jpg)
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
![Page 40: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/40.jpg)
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
![Page 41: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/41.jpg)
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
![Page 42: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/42.jpg)
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
![Page 43: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/43.jpg)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host Device
Global Memory
Host to device
![Page 44: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/44.jpg)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host Device
Global Memory
Source (host)Destination (device)
![Page 45: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/45.jpg)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host Device
Global Memory
![Page 46: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/46.jpg)
Matrix Multiply
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
P = M * N Assume M and N are
square for simplicity
![Page 47: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/47.jpg)
Matrix Multiply
1,000 x 1,000 matrix 1,000,000 dot products
Each 1,000 multiples and 1,000 adds
![Page 48: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/48.jpg)
Matrix Multiply: CPU Implementation
Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt
void MatrixMulOnHost(float* M, float* N, float* P, int width){ for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; }}
![Page 49: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/49.jpg)
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 50: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/50.jpg)
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 51: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/51.jpg)
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 52: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/52.jpg)
Matrix Multiply
Step 1Add CUDA memory transfers to the skeleton
![Page 53: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/53.jpg)
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Allocate input
![Page 54: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/54.jpg)
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Allocate output
![Page 55: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/55.jpg)
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
![Page 56: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/56.jpg)
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Read back from device
![Page 57: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/57.jpg)
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Similar to GPGPU with GLSL.
![Page 58: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/58.jpg)
Matrix Multiply
Step 2 Implement the kernel in CUDA C
![Page 59: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/59.jpg)
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Accessing a matrix, so using a 2D block
![Page 60: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/60.jpg)
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Each kernel computes one output
![Page 61: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/61.jpg)
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Where did the two outer for loops in the CPU implementation go?
![Page 62: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/62.jpg)
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
No locks or synchronization, why?
![Page 63: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/63.jpg)
Matrix Multiply
Step 3 Invoke the kernel in CUDA C
![Page 64: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/64.jpg)
Matrix Multiply: Invoke Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
One block with width by width threads
![Page 65: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/65.jpg)
Matrix Multiply
65
One Block of threads compute matrix Pd Each thread computes one element
of Pd Each thread
Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition
for each pair of Md and Nd elements
Compute to off-chip memory access ratio close to 1:1 (not very high)
Size of matrix limited by the number of threads allowed in a thread block
Grid 1
Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt
![Page 66: Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649e8a5503460f94b8f905/html5/thumbnails/66.jpg)
Matrix Multiply
What is the major performance problem with our implementation?
What is the major limitation?