cuda_c

CUDA C

082015

CUDA

2

A general purpose parallel computing platform and programming model

Introduced in 2006 by NVIDIAEnhances the compute engine in GPUs to solve complex computation problems in an efficient way

Comes with a software environment that allows developers to use C as a high level programming language other languages application programming interfaces or directives-based approaches are supported such as FORTRAN DirectCompute OpenACC

082015

CUDA C

3

082015

CUDA C - A scalable programming Model

4

Mainstream processors are parallelAdvent of many-core and multi-core chips

3D graphics applications transparently scale their parallelism on GPUs with varying number of coresChallenge To develop application software that scales transparently with the number of cores

Let programmers focus on parallel algorithmsNot on the mechanics of a parallel programming language

0820155

Facing the challenge using minimal set of language extensionsHierarchy of threads Shared MemoryBarrier synchronization

Partition the problem into coarse sub problemsSolved independently in parallel by blocks of threads

Partition sub-problems into finer piecesSolved cooperatively in parallel by all threads within the block


082015


6

Each block of threads can be scheduled on any of the available multiprocessorsWithin a GPU in any order concurrently or sequentially

A compiled CUDA program can execute on any number of multiprocessors

Only the runtime system needs to know the physical multiprocessor count

082015

Automatic Scalability

7

A multi threaded program is partitioned into blocks of threads that execute independently from each other so that a GPU with more cores will automatically execute the program in less time than GPU with fewer cores

082015

Heterogeneous Computing

8

Host CPU and its memory (host memory)Device GPU and its memory (device memory)

082015


9

082015

GPU programming model

10

Serial code executes in a HOST(CPU) thread

Parallel code executes in many concurrent DEVICE(GPU) threads across multiple parallel processing elements

082015

Compiling CUDA C Programs

11

Refer httpdocsnvidiacomcudacuda-compiler-driver-nvccaxzz3Qz0M7rGW

082015


12

Source files for CUDA applications Mixture of conventional C++ host code plus GPU device (ie GPU-)

functions CUDA compilation trajectory

Separates the device functions from the host code Compiles the device functions using proprietary NVIDIA compilers

assemblers Compiles the host code using a general purpose CC++ compiler that is

available on the host platform and Embeds the compiled GPU functions in the host object file

In the linking stage specific CUDA runtime libraries are added for supporting remote SIMD procedure calling and for providing explicit GPU manipulation such as allocation of GPU memory buffers and host-GPU data transfer

082015

Purpose of NVCC

13

Compilation trajectory involves Splitting compilation preprocessing and merging steps for each CUDA source file

CUDA compiler driver nvcc hides the intricate details of CUDA compilation from developers

nvcc mimics the behavior of the GNU compiler gcc it accepts a range of conventional compiler options such as for defining macros and includelibrary paths and for steering the compilation process

All non-CUDA compilation steps are forwarded to a general purpose C compiler that is supported by nvcc

08201514

Source Kirk and Hwu

NVCC Compiler

Host C preprocessor compilerlinker

Device just-in time compiler

Heterogeneous Computing Platform with CPUs GPUs

Host Code

Device Code (PTX)

Integrated C programs with CUDA extensions

08201515

Nvccrsquos basic workflow separates device code and host code and thenCompiles the device code into an assembly form (PTX code) or binary form (cubin object)

Modifies the host code by replacing syntax in kernels by necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code cubin object

Modified host code is output as object code by letting nvcc invoke the host compiler during last compilation stage

082015

Anatomy of a CUDA C Program

16

Step 1 Copy input data from host memory to device memory

082015


17

Step 2 Launch a kernel on the device

082015


18

Step 3 Execute the kernel on the device caching data on chip for performance

082015


19

Step 4 Copy results from device memory to the host memory

082015

Thread

20

Simplified view of how a processor executes a programConsists of Code of the programParticular point in the code that is being executedValues of its variables and data structures

In CUDA execution of each thread is sequential But CUDA program initiates parallel execution by launching kernel functionsCauses the underlying mechanisms to create many threads that process different parts of the data in parallel

082015

CUDA Kernels

21

Kernel is function that executes parallel portions of an application on the device and can be called from the hostOne kernel is executed at a time by many threads

Execute as an array of threads in parallelAll threads run same codeEach thread has an ID that is used to compute memory addresses and make control decisions

Can only access device memory

082015

Thread Hierarchies

22

GridOne or more thread blocksOrganized as a 3D array of blocks

Block3D array of threadsEach block in a grid has the same number of threads (dimension)

Each thread in a block can SynchronizeAccess shared memory

082015

Thread Hierarchies

23

A kernel is executed as a grid of thread blocksAll threads share data memory space

A thread block is a batch of threads that can cooperate with each other bySynchronizing their executionFor hazard-free shared memory accesses

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27

Thread BlockGroup of threadsG80 and GT200 Up to 512 threadsFermi Up to 1024 threads

Reside on same processor coreShare memory of that core

082015

Threads Representation

28

Initialize many threads Hundreds or Thousands to wake up a GPU from its Bed

082015

CUDA Thread Organization

29

Threads are grouped into blocks Blocks are grouped into Grids

Kernels are executed as a Grid of Block of Threads

082015


30

Only one kernel can execute on a device at one time

08201531

082015


32

All threads in a grid execute the same kernel functionRely on unique coordinates to distinguish themselves

Two-level hierarchy using unique coordinatesblockIdx (for block index) ndash shared by all threads in a block threadIdx (for thread index) ndash unique within a blockUsed together to make a unique ID for each thread per kernel

Built-in pre-initialized variables accessed within kernelsReferences to them return coordinates of the thread when executed

Kernel launch specifies the dimensions of the grid and each blockgridDim and blockDim

08201533

Your First CUDA Program

082015

Program 1 Hello World

34

Standard C that runs on the hostNVIDIA compiler (nvcc) can be used to compile programs with no device code

int main(void) printf(Hello Worldn) return 0

Output $ nvcc hello_worldcu $ aout Hello World $

082015

Program 1 Hello World with Device Code

35

Source J sanders etal Cuda by example

__global__ void mykernel(void) int main(void) mykernelltltlt11gtgtgt() printf(Hello Worldn) return 0

bull Two new syntactic elements

082015


36

CUDA CC++ keyword __global__ indicates a function that Runs on the device Is called from host code

nvcc separates source code into host and device components Device functions (eg mykernel()) processed by NVIDIA compiler

Host functions (eg main()) processed by standard host compiler gcc clexe

__global__ void mykernel(void)

082015


37

Triple angle brackets mark a call from host code to device code Also called a ldquokernel launchrdquo Wersquoll return to the parameters (11) in a moment

Thatrsquos all that is required to execute a function on the GPU

mykernelltltlt11gtgtgt()

082015

Program 2 Summing two vectors

38

08201539


08201540

Can also be written in this way

bull Suggest a potential way to parallelize the code on a system with multiple CPUs or CPU cores

bull For example with a dual-core processor bull Can change the increment to 2bull Have one core initialize the loop with tid = 0 and another with tid = 1 bull The first core would add the even-indexed elements and the second core would add the odd-indexed elements

082015

Multi-core CPU vector sum

41

082015

GPU vector sum

42

includeltcudahgthellipvoid vecAdd(float A float B float C int n)

1 Allocate device memory for A B and C Copy A and B to device memory

2 Kernel launch code ndash to have the device perform the actual vector addition

3 Copy C from the device memory to the host memory Free memory allocated on the device

08201543

bull N is the number of parallel blocks specified while launching the kernel

bull Below we can see four blocks all running through the same copy of the device code but having different values for the block id

082015

CUDA Device Memory Model

44

082015


45

Global and Constant MemoryWritten and read by host by calling API functionsConstant Memory short-latency high-bandwidth read-only access by the device$

Register and shared memory (on-chip memories)Accessed at high speed in a parallel mannerRegistersAllocated to individual threadsEach thread can only access its own registers

082015


46

Kernel function uses registers to hold frequently accessed variables private to each thread

Shared memory is allocated to thread blocksAll threads in a block can access the variables Efficient means for threads to cooperate by sharing input data and intermediate results

082015

CUDA API function for managing device global memory

47

082015

cudaMalloc()

48

First parameter address of a pointer variable that must point to the allocated object after allocationThe address of the pointer variable should be cast to (void ) Function expects a generic pointer value Allows the cudaMalloc() function to write the address of the allocated object into the pointer variable

The second parameter size of the object to be allocated in terms of bytes

float d_Aint size = n sizeof(float)cudaMalloc((void)ampd_A size)hellipcudaFree(d_A)

082015

CUDA APIs

49

082015

cudaMemcpy()

50

Cannot be used to copy between different GPUsCan be used to transfer data in both directions by Proper ordering of source and destination pointers and Using the appropriate constant for transfer type

Four symbolic predefined constants cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

cudaMemcpy(d_A A size cudaMemcpyHostToDevice)cudaMemcpy(d_C C size cudaMemcpyDeviceToHost)

082015

GPU vector sum

51

includeltcudahgthellipvoid vecAdd(float A float B float C int n) int size = n sizeof(float) float d_A d_B d_C$ cudaMalloc((void) ampd_A size) cudaMemcpy(d_A A size cudaMemcpyHostToDevice) cudaMalloc((void) ampd_B size) cudaMemcpy(d_B B size cudaMemcpyHostToDevice) cudaMalloc((void) ampd_C size)

Kernel invocation code cudaMemcpy(C d_C size cudaMemcpyDeviceToHost)

Free device memory cudaFree(d_A) cudaFree(d_B) cudaFree(d_C)

082015

CUDA Extensions to function declaration

52

Executed on the

Only callable from the

__device__ float DeviceFunc() Device Device

__global__ void KernelFunc() Device Host

__host__ float HostFunc() Host Host

082015

CUDA Memory Hierarchy

53

Thread Registers and Local Memory Blocks Shared Memory

082015


54

Grid(All thread blocks) Global Memory

bull Registersbull Per thread Data lifetime = thread lifetime

bull Local Memorybull Per thread off-chip memorybull Data lifetime = thread lifetime

bull Shared Memorybull Per thread block on-chipbull Data lifetime = block lifetime

bull Global Memorybull Accessible by all threads as well as hostbull Data lifetime = from allocation to deallocation

bull Host Memorybull Not directly accessible by CUDA threads

08201555

082015

Thread and Memory Organization

56

Blocks execute on SM

082015


57

Grid of Blocks across GPU

082015

Thread and Block Indexing

58

Each Thread has an IDPredefined variables that allow a thread to access the hardware registers at runtime that provide the identifying coordinates to the thread

Threads 3D IDs and unique within a BlockBuilt-in variables threadIdx blockIdx blockDim gridDim

082015

Thread Indexing

59

bull Grid consists of N thread blocksbullEach with a blockIdxx value that ranges from 0 to N ndash 1

bull Each block consists of M threadsbullEach with a threadIdxx value that ranges from 0 to M ndash 1

bull All blocks at the grid level are organized as a one-dimensional (1D) arraybull All threads within each block are also organized as a 1D array bull Each grid has a total of NM threads

082015

Thread Indexing

60

Thread id= blockIdxx blockDimx + threadIdxxThread 3 of Block 0 has a threadID value of 0M +3Thread 3 of Block 5 has a threadID value of 5M+3

Eg A grid with 128(N) blocks and 32(M) threads in each blockTotal of 12832=4096 threads in a grid

Each thread should have a unique ID Thread 3 of Block 0 has a threadID value of 032 +3Thread 3 of Block 5 has a threadID value of 532+3=163Thread 15 of block 102 threadID =

082015


61

Grid Dimension lt3 2gtgridDimx=3gridDimy=2

blockIdxx=1blockIdxy=1

Block Dimension lt5 3gtblockDimx=5blockDimy=3

threadIdxx=4threadIdxy=2

082015

Vector Addition Kernel Function

62

__global__ void vecAddKernel(float A float B float C int n) int i = threadIdxx + blockDimx blockIdxx if(i lt n) C[i] = A[i] + b[i]

bull The if(iltn) statement allows the kernel to process vectors of arbitrary lengths

bull Example If vector length is 100 and thread block dimension is 32 then four thread blocks need to be launched (128 threads) Last 28 threads in block 3 need to be disabled

082015

Thread Indexing and Organization

63

In generalGrid is organized as a 3D array of blocksBlock is organized as 3D array of threads

Exact organization of a grid is determined by the execution configuration provided at kernel launch

When the host code invokes a kernel it sets the grid and thread block dimensions via execution configuration parameters

2 parameters One describes the configuration of the grid Number of blocksSecond one describes the configuration of blocks groups of threads

Each parameter is dim3 type ndash C struct type with 3 unsigned integer fields x y and z (three dimensions)

082015


64

For 1D and 2D gridsblocks the unused fields should be set to 1

1D grid with 128 blocks each of which consists of 32 threadsTotal number of threads = 128 32 = 4096$ dim3 dimGrid(128 1 1)$ dim3 dimBlock(32 1 1)$ vecAddKernelltltltdimGrid dimBlockgtgtgt(hellip)dimBlock and dimGrid are programmer defined variablesThese variables can have any names as long as they are of type dim3 and kernel launch uses appropriate names

082015


65

If a Grid Block has only 1 Dimension instead of dim3 use arithmetic expressions to specify the configuration

Compiler takes the expression as x dimensions and assumes y and z are 1$ vecAddKernelltltltceil(n2560) 256gtgtgt(hellip) gridDim and blockDim are part of CUDA C specification and

cannot be changed The x field of the predefined variables gridDim and blockDim get

preinitialized based on the execution configuration parameters If n is 4000 then gridDimx = 16 and blockDimx = 256

082015


66

Allowed values of gridDimx gridDimy and gridDimz range from 1 to 65536

All threads in a block share same blockIdxx blockIdxy and blockIdxz

In a gridThe blockIdxx ranges between 0 and gridDimx ndash 1The blockIdxy ranges between 0 and gridDimy ndash 1 The blockIdxz ranges between 0 and gridDimz ndash 1

Total size of a block is limited to 1024 threadsFlexibility to divide into 3 dimensionsExample blockDim values

(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567

2D grid (2 2 1) consisting of 3D blocks (4 2 2)

Host codedim3 dimGrid(2 2 1)dim3 dimBlock(4 2 2)KernelltltltdimGriddimBlockgtgtgt()

082015

CUDA execution configuration parameters

68

$ Eg dim3 dimBlock(53) dim3 dimGrid(32)Kernel call Eg gaussltltltdimGriddimBlockgtgtgt( )

082015

Kernel Launch Statement

69

vecAddKernelltltltceil(n2560) 256gtgtgt(d_A d_B d_C n)

Number of thread blocks

Number of threads in each block

bull If there are 1000 threads we launch ceil(10002560) = 4 thread blocks

bull It will launch 4 256 = 1024 threadsbull Number of thread blocks depends on the length of the

vectors (n)bull If n = 750bull 3 thread blocks

bull If n = 4000 bull 16 thread blocks

082015

Vector Addition Kernel Launch

70

includeltcudahgtvoid vecAdd(float A float B float C int n) int size = n sizeof(float)$ float d_A d_B d_C$ cudaMalloc((void) ampd_A size)$ cudaMemcpy(d_A A size cudaMemcpyHostToDevice)$ cudaMalloc((void) ampd_B size)$ cudaMemcpy(d_B B size cudaMemcpyHostToDevice)$ cudaMalloc((void) ampd_C size)

$ vecAddKernelltltltceil(n2560) 256gtgtgt(d_A d_B d_C n)$ cudaMemcpy(C d_C size cudaMemcpyDeviceToHost)

$ Free device memory$ cudaFree(d_A) cudaFree(d_B) cudaFree(d_C)

082015

CUDA

2

A general purpose parallel computing platform and programming model

Introduced in 2006 by NVIDIAEnhances the compute engine in GPUs to solve complex computation problems in an efficient way

Comes with a software environment that allows developers to use C as a high level programming language other languages application programming interfaces or directives-based approaches are supported such as FORTRAN DirectCompute OpenACC

082015

CUDA C

3

082015


4




0820155





082015


6




082015


7


082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

CUDA C

3

082015


4




0820155





082015


6




082015


7


082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


4




0820155





082015


6




082015


7


082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




0820155





082015


6




082015


7


082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


6




082015


7


082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


7


082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


8


082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


9

082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


10



082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


11


082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


12







082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Purpose of NVCC

13





08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201514

Source Kirk and Hwu

NVCC Compiler




Host Code

Device Code (PTX)


08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201515




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


16


082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


17


082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


18


082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


19


082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread

20



082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

CUDA Kernels

21




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread Hierarchies

22




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread Hierarchies

23





082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread Hierarchies

24

08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201525

082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread Hierarchies

27



082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


28


082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


29



082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


30


08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201531

082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


32





08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201533


082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


34




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


35




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


36





082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


37




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


38

08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201539


08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201540




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


41

082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

GPU vector sum

42





08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201543



082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


44

082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


45



082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


46



082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


47

082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

cudaMalloc()

48




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

CUDA APIs

49

082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

cudaMemcpy()

50




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

GPU vector sum

51




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


52

Executed on the





082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


53


082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


54







08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201555

082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


56


082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


57


082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


58



082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread Indexing

59




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015

Thread Indexing

60




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


61





082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


62




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


63






082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


64



082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


65





082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




082015


66





(512 1 1) (8 16 4) and (32 16 2) (32 32 32)

08201567



082015


68


082015


69








082015


70




08201567



082015


68


082015


69








082015


70




082015


68


082015


69








082015


70




082015


69








082015


70




082015


70




cuda_c

Documents

block cuda c

cuda c082015cuda

gpu device

compiled cuda program

compiled gpu functions

hostgpu data transfer

host platform

device functions