parallel computing on graphics processors

57

Upload: linnea

Post on 22-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Parallel Computing on Graphics Processors. Graphics Processors Importance Properties and Features Inside Nvidia GPUs How do they operate? CUDA What is CUDA? Major concepts and extensions How a code can be written in CUDA for running on GPU? A sample code of Bitonic sort in CUDA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Computing on Graphics Processors
Page 2: Parallel Computing on Graphics Processors

Parallel Computing on Graphics ProcessorsGraphics Processors

ImportanceProperties and FeaturesInside Nvidia GPUsHow do they operate?

CUDAWhat is CUDA?Major concepts and extensionsHow a code can be written in CUDA for

running on GPU?A sample code of Bitonic sort in CUDA

A hybrid sorting algorithm on GPU

Page 3: Parallel Computing on Graphics Processors

Graphics Processors

Page 4: Parallel Computing on Graphics Processors

IntroductionGPU (Graphics Processing Unit) originally is a

co-processor beside CPU to perform graphics related jobs which have an output to display devices.

Increasing market demand for real time, and high definition 3D graphics has resulted in highly parallel, many-core programmable GPUs.

GPUs hasmulti-threaded hardware structureTremendous computational powerVery high memory bandwidth

Page 5: Parallel Computing on Graphics Processors

Introduction (cont.)

Page 6: Parallel Computing on Graphics Processors

Introduction (cont.)Why Parallel Computing?

Recent GPUs has many simple cores that can operate in parallel.

They are able to perform different instructions like a general purpose processor.

They operate as a SIMD (Simple Instruction, Multiple Data) architecture.

It is not completely SIMD but SIMT (Simple Instruction, Multiple Threads).

Parallel structure of GPUs can be used to perform different general purpose tasks beside CPUs.

Page 7: Parallel Computing on Graphics Processors

Introduction (cont.)

681 million transistors128 single processors1.5 GHz processor

clock576 Gflop/s768 Mbyte DDR3

DRAM1.08 GHz DRAM clock104 Gbyte/s bandwidth

Geforce 8800 Ultra die layout

Page 8: Parallel Computing on Graphics Processors

Introduction (cont.)Geforce 200 series

1.4 B transistors583.2 mm

Less than 2cm x 3 cm

192-240 single processors

896 Mbyte RAMThus, it is a nice parallel platform for scientific parallel computing.

Page 9: Parallel Computing on Graphics Processors

Inside GPUMany Single ProcessorsLocal Storages for Processors

PrivateShared

Global StoragesCommunication between Processors and

MemoryInterconnection Network

Interface for communication between GPU, CPU, and Main Memory

Graphics Related Units

Page 10: Parallel Computing on Graphics Processors
Page 11: Parallel Computing on Graphics Processors

Inside GPUMain(){...Y=sin(x);F=Y^2;......}

Sin(1)

Sin(2)

Sin(0)

Sin(3)

Sin(3)

Sin(4)

Page 12: Parallel Computing on Graphics Processors
Page 13: Parallel Computing on Graphics Processors

Layout of a TPC including 2 SM(Streaming Multiprocessor)

Page 14: Parallel Computing on Graphics Processors

Inside GPUEach GPU contains many

TPCs(Texture/Processor Cluster)Number of TPCs in GPUs are increasing.

Geforce 8 series : 8 TPCsGeforce 200 series : 10 TPCs

Each TPC contains:A Geometry Controller (Graphics Related)A SMC (Streaming Multiprocessor Controller)2 or 3 SMs (Streaming Multiprocessor)A Texture Unit (Graphics Related)

Page 15: Parallel Computing on Graphics Processors

Inside GPUEach SM contains:

8 SP(Streaming Processors)Let us call them “cores”Each core has a

MAD(Multiply-add) unitAn Instruction CacheA MT unit (Multithreaded

Instruction Fetch and Issue Unit)

A Constant Cache2 SFUs for

transcendental functions (sin, root, etc.)

A 16 Kbyte Shared Memory

Layout of a SM(Streaming Multiprocessor)

Page 16: Parallel Computing on Graphics Processors

Inside GPUEach core has its own set of registers and register

states.Shared Memory, Instruction Cache, and Constant

Cache can only be accessed by cores and other units of a SM.NOT other SMs!

Workload is distributed by SMC(Streaming Multiprocessor Controller) between SMs.

MT unit of SM fetch instructions, issue, and distribute them between cores.

Each core fetches the data it needs from shared memory or global memory and executes the instruction.

Page 17: Parallel Computing on Graphics Processors

How does a GPU operate?

SIMD ModelAn instruction is

executed by many cores.Different dataThey all have to

execute Y=sin(x);

Main(){...Y=sin(x);F=Y^2;......}

Page 18: Parallel Computing on Graphics Processors

How does a GPU operate?SIMT (Single Instruction Multiple Threads)

ModelAn instruction can be executed by many

threads.Each thread is mapped to one core.Each thread can be seen as a core, as a

virtual simple processor.All properties of a core are true for a thread.

Main(){...Y=sin(x);F=Y^2;......}

Page 19: Parallel Computing on Graphics Processors

How does a GPU operate?Remember:

Each core has its own registers and register states.

Each core has its own IP(Instruction Pointer) register.

Therefore, a thread has its own registers, register states, and instruction address.

What does this mean?!It means that each thread can:

Run a different instruction independent of other threads.Have its own values resulted from the sequence of

instructions it has executed so far.

Page 20: Parallel Computing on Graphics Processors

How does a GPU operate?What is the result/advantage?

GPU is not a SIMD architecture, but a SIMT.We have many threads/cores that can

operate similar to many parallel independent processors.

Thus, we have a Parallel Multi-threaded shared memory architecture.

Main(){...Y=sin(x);F=Y^2;If (f>10)else...}

Page 21: Parallel Computing on Graphics Processors

How does a GPU operate?Notes:

Threads start together at the same instruction address.

Threads can not go very far away from each otherBecause of the Instruction Cache which has a fixed

capacity to fetch instructions.On conditional branching instructions:

Threads which are further, have to wait for other threads.It is called “Thread Divergence”.Because according to the condition, some threads may

want to go to some far set of instructions, While the other want to continue with the current IP and

instructions.Each group of threads is serially executed while the other

groups have to wait.

Page 22: Parallel Computing on Graphics Processors

How does a GPU operate?Key idea is to create

too many threads.Then, they start to

execute instructions of your code starting at the same address.

We have many SMs (streaming multiprocessors), each one contains 8 cores.

Page 23: Parallel Computing on Graphics Processors

How does a GPU operate?As a result, we have to

group threads and distribute them between SMs and eventually cores.

These groups are called “Warps”. Each warp contains 32 threads.

Each time, a warp is associated with a SM.

When a SM executes a warp, it does not pay attention to other warps.

Page 24: Parallel Computing on Graphics Processors

How does a GPU operate?Cores of a SM execute

threads of a warp in parallel.

All SMs operate in parallel.

All 32 threads of a warp can access shared memory.

Because each time a SM executes a warp, thread divergence only occurs within a warp.

Page 25: Parallel Computing on Graphics Processors

Mapping Code to ProcessorsHow our code is mapped to threads which

will be executed by cores?

Main(){...Y=sin(x);F=Y^2;If (f>0.5)else...}

Code is divided into the parts that should be executed on CPU and parts that should be executed on GPU.We are interested in GPU related parts.

All instructions run sequentially.When we reach an instruction from GPU related section, it is taken and is sent to GPU.

We call these taken instructions, sent from CPU to GPU, “Kernels”.

Page 26: Parallel Computing on Graphics Processors

Mapping Code to Processors

Main(){...Y=sin(x);F=Y^2;If (f>0.5)else...}

CPU Kernel GPUsin(3)

sin(5)

sin(4)sin(

1)

sin(2)

sin(1)

Each kernel is mapped to a “Grid”.A grid contains too many threads.Each time a grid is executed on GPU.

Page 27: Parallel Computing on Graphics Processors

Mapping Code to Processors

Y=sin(x)

CPU

Kernel

Each Grid Contains many Blocks.Each “Thread Block” contains many threads.

Each Block contains up to 512 threads.Threads of a block are grouped into warps.

Each grid can have as many block as is needed.

Page 28: Parallel Computing on Graphics Processors

Mapping Code to Processors

Threads inside a block can be organized as a 3D matrix and can be accessed by three indices (x,y,z).

Blocks inside a grid are organized as a 2D matrix.

So, each thread and each block are accessible by programmer.

They are accessible by two pre-defined variables:

1. ThreadIdx2. BlockIdx

Page 29: Parallel Computing on Graphics Processors

Mapping Code to Processors

int main(){// Kernel invocationvecAdd<<<1, N>>>(A, B, C);}

void vecAdd(float* A, float* B, float* C){int i = threadIdx.x;C[i] = A[i] + B[i];}

A(2)

A(1)

A(4)

A(3)

A(5)

A(6)

B(2)

B(1)

B(4)

B(3)

B(5)

B(6)

A B

threadIdx.1

threadIdx.2

threadIdx.3

threadIdx.4

threadIdx.5

threadIdx.6

Page 30: Parallel Computing on Graphics Processors
Page 31: Parallel Computing on Graphics Processors

Mapping Code to ProcessorsA grid is executed on the

whole GPU and its SMs.Each thread block is

executed on only one SM.A SM does not switch to

other blocks only if it completes current block.

Threads of a block are not aware of other blocks, but they can communicate through global memory.

Threads inside a block can see each other and communicate through the SM’s shared memory.

Page 32: Parallel Computing on Graphics Processors

Another layout of GPU, SMs, cores, and memories

Page 33: Parallel Computing on Graphics Processors

CUDA Compute Unified Device Architecture

Page 34: Parallel Computing on Graphics Processors

CUDANvidia developed a programming environment

which mixes CPU and GPU programming.It is an extension to C/C++.The extension contains new rules and instructions

that are designated for running on GPU, and communication between GPU and CPU.

You can write your code including everything that you like to be executed on either CPU or GPU in CUDA.

CUDA compiler (NVCC) parses the code and recognizes kernels and other parts.

It compiles kernels to be sent to GPU and for CPU instructions:Just sends them to a C/C++ compiler

Page 35: Parallel Computing on Graphics Processors

Programming in CUDAThe extensions to the C programming

language are four-fold:Function type qualifiers

to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device.

Variable type qualifiers to specify the memory location on the device of a

variable.How to run a kernel

A new directive to specify how a kernel is executed on the device from the host.

Built-in variables Four built-in variables that specify the grid and block

dimensions and the block and thread indices.

Page 36: Parallel Computing on Graphics Processors

Function Type Qualifiers__device__The __device__ qualifier declares a function that is:

Executed on the device Callable from the device only.

__global__The __global__ qualifier declares a function as

being a kernel. Executed on the device, Callable from the host only.

__host__The __host__ qualifier declares a function that is:

Executed on the host, Callable from the host only.

Page 37: Parallel Computing on Graphics Processors

Function Type Qualifiers__device__ and __global__ functions do not

support recursion.__global__ functions must have void return

type.

__global__ void vecAdd(float* A, float* B, float* C)

{int i = threadIdx.x;C[i] = A[i] + B[i];}

Page 38: Parallel Computing on Graphics Processors

Variable Type Qualifiers__device__

The __device__ qualifier declares a variable that resides on the device.

Default is Global memory.__constant__

The __constant__ qualifier, optionally used together with __device__, declares a variable that:

Resides in constant memory space. Has the lifetime of an application. Is accessible from all the threads within the grid.

__shared__The __shared__ qualifier, optionally used together with

__device__, declares a variable that: Resides in the shared memory space of a SM. Has the lifetime of the block. Is only accessible from all the threads within the block.

Page 39: Parallel Computing on Graphics Processors

Variable Type QualifiersIf none of them is present, the variable:

Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the

grid.

__shared__ int values[];

Page 40: Parallel Computing on Graphics Processors

Execution Configuration Any call to a __global__ function must specify the execution

configuration for that call.

int main() { // Kernel invocation vecAdd<<<1, N>>>(A, B, C); }

void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

Page 41: Parallel Computing on Graphics Processors

Execution ConfigurationExpression of the form <<< Dg, Db, Ns, S >>>

between the function name and the parenthesized argument list, where: Dg is specifies the dimension and size of the

grid, i.e. number of blocks being launched; Db specifies the dimension and size of each

block, i.e. the number of threads per block; Ns specifies the number of bytes in shared

memory that is dynamically allocated per block for this call in addition to the statically allocated memory;

S is of type cudaStream and specifies the associated stream.

Page 42: Parallel Computing on Graphics Processors

Built-in Variables

gridDimblockIdxblockDimthreadIdxwarpSize

Page 43: Parallel Computing on Graphics Processors

Device MemoryDevice memory can be allocated either as

linear memory or as CUDA arrays.Arrays can be defined like C array definition

and using variable qualifiers.Linear memory exists on the device in a 32-

bit address space.Accessible via pointers.

Both linear memory and CUDA arrays are readable and writable by the host through the memory copy functions.

Page 44: Parallel Computing on Graphics Processors

Memory ManagementLinear memory is allocated using

cudaMalloc()freed using

cudaFree()The following code sample allocates an

array of 256 floating-point elements in linear memory:float* devPtr;cudaMalloc((void**)&devPtr, 256 * sizeof(float));

Page 45: Parallel Computing on Graphics Processors

Memory ManagementThe following code sample copies some host

memory array to device memory:

float data[256];int size = sizeof(data);float* devPtr;cudaMalloc((void**)&devPtr, size);cudaMemcpy(devPtr, data, size,

cudaMemcpyHostToDevice);

Page 46: Parallel Computing on Graphics Processors

Thread Synchronization in CUDAIn some situations, it is necessary that all

threads reach at a point together before continuing the execution.

Because in the next part, we need the results of execution up to that point.

Without synchronization, some threads may go further and access incomplete results produced by slow threads that have not reached the point yet.

__syncthreads();synchronizes all threads in a block. Once all threads

have reached this point, execution resumes normally.

cudaThreadSynchronize()Synchronizes all threads in a grid.

Page 47: Parallel Computing on Graphics Processors

Bitonic Sort on GPU using CUDA int main(int argc, char** argv) { int values[NUM]; for(int i = 0; i < NUM; i++) values[i] = rand(); int * dvalues; CUDA_SAFE_CALL(cudaMalloc((void**)&dvalues, sizeof(int) *

NUM)); CUDA_SAFE_CALL(cudaMemcpy(dvalues, values, sizeof(int) *

NUM, cudaMemcpyHostToDevice)); bitonicSort<<<1, NUM, sizeof(int) * NUM>>>(dvalues); CUDA_SAFE_CALL(cudaMemcpy(values, dvalues, sizeof(int) *

NUM, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(dvalues)); CUT_EXIT(argc, argv); }

Page 48: Parallel Computing on Graphics Processors

Bitonic Sort on GPU using CUDA

#define NUM 256 ___global__ static void

bitonicSort(int * values) { extern __shared__ int

shared[]; const unsigned int tid =

threadIdx.x; shared[tid] = values[tid]; __syncthreads();

FOR LOOP at right of this page

// Write result. values[tid] = shared[tid]; }

for (unsigned int k = 2; k <= NUM; k *= 2){

for (unsigned int j = k / 2; j>0; j /= 2){ unsigned int ixj = tid ^ j; if (ixj > tid){ if ((tid & k) == 0) if (shared[tid] >

shared[ixj]) swap(shared[tid],

shared[ixj]); else if (shared[tid] <

shared[ixj]) swap(shared[tid],

shared[ixj]); } __syncthreads(); } }

Page 49: Parallel Computing on Graphics Processors

A hybrid sorting algorithm on GPU

Erik Sintorn, Ulf Assarsson, “Fast Parallel GPU-Sorting Using a Hybrid Algorithm”, Journal of Parallel and Distributed Computing, Vol. 68, Issue 10(October 2008), Pages: 1381-1388, 2008.

Page 50: Parallel Computing on Graphics Processors

Fast Parallel GPU-Sorting Using a Hybrid AlgorithmThe algorithm is a combination of two well-

known sorting algorithms:Merge SortBucket Sort

Two levels of sorting:External Sort:

Using Bucket SortInternal sort:

Using Merge Sort

Page 51: Parallel Computing on Graphics Processors

Fast Parallel GPU-Sorting Using a Hybrid AlgorithmTwo main steps

Dividing list of items into L sublists with equal sizesFor 1<i<L : Items of list i+1 are larger than items of list

iThis is done using Bucket Sort defined in T.H. Cormen,

Section 9.4: Bucket sort, in “Introduction to Algorithms”, Second Edition, MIT Press and McGraw-Hill, 2001, pp. 174-177.

Internally sorting each sublist using Merge sort.This is done using a vector-based merge sort.In vector-based merge sort, vectors of length 4 are

compared instead of comparing individual items.

Page 52: Parallel Computing on Graphics Processors

Fast Parallel GPU-Sorting Using a Hybrid Algorithm Dividing list of items into L sublists with equal sizes At First, Find the maximum and minimum elements of the

list. Then execute the following psuedo-code: { Element=input(threadId); Index(threadId)=((element-min)/(max-min)*L); }

23 7 12 18 9 2 1 15 10 8Min=1Max=23L=4

1< ....<6.5

6.5<...<12

12<..<17.5

17.5<..<23

Page 53: Parallel Computing on Graphics Processors

Fast Parallel GPU-Sorting Using a Hybrid AlgorithmA refining process is run over all sublists to

change their upper and lower bounds.Finally, we have L sublists with equal sizes.To have all items sorted, it is enough to

sort each sublist internally.Each sublist is given to a thread block in

order to be sorted by a SM.This can be done using a vector merge

sort.

Page 54: Parallel Computing on Graphics Processors

Fast Parallel GPU-Sorting Using a Hybrid Algorithm

Geforce 8800 contains some vector-based operations.Length of a vecor in Geforce 8800 is 4.These vectors are called 4-float vectors.

4-float Vector Comparison4-float Vector Sort

Items of each list are grouped into vectors of length 4.

4-float vectors of each list are sorted using merge sort by a SM.

Page 55: Parallel Computing on Graphics Processors

Vectore-Based Merge Sorti(th) Sub-list

Page 56: Parallel Computing on Graphics Processors

Comparison to other GPU-based Sorting Algorithms

Page 57: Parallel Computing on Graphics Processors

General Purpose Programming on GPU