cjharris gpu computing opencl

7/28/2019 Cjharris Gpu Computing Opencl

http://slidepdf.com/reader/full/cjharris-gpu-computing-opencl 1/61

Getting Started with OpenCL GPU Computing

iVEC Workshop

30th May - 1st June 2012

Open Compute Language (OpenCL)

OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors

found in personal computers, servers andhandheld/embedded devices.

Participating companies and institutions:

OpenCL is being created by the Khronos Group:

3DLABS, Activision Blizzard, AMD, Apple, ARM, Broadcom, Codeplay, Electronic Arts,Ericsson, Freescale, Fujitsu, GE, Graphic Remedy, HI, IBM, Intel, Imagination

Technologies, Los Alamos National Laboratory, Motorola, Movidius, Nokia, NVIDIA,Petapath, QNX, Qualcomm, RapidMind, Samsung, Seaweed, S3, ST Microelectronics,

Takumi, Texas Instruments, Toshiba and Vivante.

http://www.khronos.org/opencl/

How is OpenCL different from CUDA?

core GPU computingon NVIDIA hardware

AMD implementation

on AMD CPU/GPUand Intel CPU

Intel implementationon Intel CPU

IBM implementationon Intel/AMD/NVIDIA/Power

Intel implementationon Intel MIPS

optimised librariesfor NVIDIA hardware

better marketing

slightly simpler API

more readily availabledocumentation

portable, but not necessarilyoptimised code

OpenCL Platforms

Platform:

A host plus a collection of devices managed by the OpenCLframework that allow an application to share resources andexecute kernels on devices in the platform.

Device Device Device

Platform

OpenCL Platforms : clGetPlatformIDs

cl_int clGetPlatformIDs( cl_uint num_entries,cl_platform_id* platforms,cl_uint* num_platforms)

num_entries : capacity of platform IDs in memory pointed to by platforms platforms : pointer to memory to store returned platform IDsnum_platforms : returns the actual number of platforms IDs available

Either CL_SUCCESS or an error code.

The following routine is used to query the number of

OpenCL platforms, and their corresonding IDs:

Arguments

Returns

OpenCL Platforms : clGetPlatformIDs

#include <stdio.h>#include <CL/cl.h>

int main(int argc, char** argv){

// determine number of platformscl_int clErr;

cl_uint num_platforms;clErr = clGetPlatformIDs(0,NULL,&num_platforms);checkErr(clErr,__FILE__,__LINE__);printf("OpenCL Platforms found: %i\n",num_platforms);if(num_platforms<1) { exit(0); }

// get platform IDscl_platform_id platforms[num_platforms];clErr = clGetPlatformIDs(num_platforms,platforms,NULL);checkErr(clErr,__FILE__,__LINE__);

return 0;}

Code Example:

OpenCL Errors

void checkErr(cl_int clErr, char* filename, int line){

if (clErr!=CL_SUCCESS){

printf("OpenCL Error %i at line %i of%s\n",clErr,line,filename);exit(EXIT_FAILURE);

You can find the error codes in cl.h :

/* Error Codes */#define CL_SUCCESS 0#define CL_DEVICE_NOT_FOUND -1#define CL_DEVICE_NOT_AVAILABLE -2#define CL_COMPILER_NOT_AVAILABLE -3#define CL_MEM_OBJECT_ALLOCATION_FAILURE -4#define CL_OUT_OF_RESOURCES -5#define CL_OUT_OF_HOST_MEMORY -6#define CL_PROFILING_INFO_NOT_AVAILABLE -7#define CL_MEM_COPY_OVERLAP -8#define CL_IMAGE_FORMAT_MISMATCH -9#define CL_IMAGE_FORMAT_NOT_SUPPORTED -10#define CL_BUILD_PROGRAM_FAILURE -11

#define CL_MAP_FAILURE -12

Where is OpenCL on Fornax?

NVIDIA Implementation (for NVIDIA GPU)

AMD Implementation (for Intel CPU)

Intel Implementation (for Intel CPU)

module load cuda /opt/centos6.1-modules/cuda/4.1.28/cuda/include/CL/cl.h/opt/nodes.updates/login.cuda.lib/lib64/libOpenCL.so/opt/nodes.updates/login.cuda.lib/lib/libOpenCL.so

module load AMDAPP/opt/centos6.1-modules/AMDAPP/2.5/include/CL/cl.h/opt/centos6.1-modules/AMDAPP/2.5/lib/x86_64/libOpenCL.so/opt/centos6.1-modules/AMDAPP/2.5/lib/x86/libOpenCL.so

not installed - hasn't been requested

Compiling OpenCL on Fornax (NVIDIA)

Load required modules if necessary:

module load gccmodule load cuda

Command line compile:

gcc platform_id.c -o platform_id -lOpenCL

Better to use a Makefile:

default:gcc platform_id.c -o platform_id -lOpenCL

And compile with make:

Running OpenCL on Fornax (NVIDIA)

Change to scratch directory:

cd /scratch/ projectname/username/programpath

One node PBS script subPlatformID:

#!/bin/bash#PBS -W group_list= projectname #PBS -q workq#PBS -l walltime=00:10:00

#PBS -l select=1:ncpus=1:ngpus=1:mem=64gb#PBS -l place=excl

module load cuda

cd /scratch/ projectname/username/programpath/home/username/ programpath/platform_id

Submit with qsub:

qsub subPlatformID

Check queue, directory for output:

qstatlscat subPlatformID.oXXXX subPlatformID.eXXXX

OpenCL Platforms : clGetPlatformInfo

cl_int clGetPlatformInfo( cl_platform_id platform,cl_platform_info param_name,size_t param_value_size,void* param_value,size_t param_value_size_ret )

platform : the platform being queried param_name : CL_PLATFORM_PROFILE, CL_PLATFORM_VERSION param_value_size : size of memory pointed to by param_value param_value : pointer to memory to store return value param_value_size_ret : returns the size in bytes of data being queried

The following OpenCL routine is used to query platforms:

Arguments

Returns

OpenCL Platforms : clGetPlatformInfo

// get platform infoint i;for (i=0; i<num_platforms; i++){

size_t size;

clErr = clGetPlatformInfo(platforms[i],CL_PLATFORM_VENDOR,0,NULL,&size);checkErr(clErr,__FILE__,__LINE__);char vendor[size];clErr = clGetPlatformInfo(platforms[i],CL_PLATFORM_VENDOR,size,vendor,NULL);checkErr(clErr,__FILE__,__LINE__);printf("Platform %i: %s\n",i,vendor);

Code Example:

OpenCL Programming Task : Platform Query

Write a program that prints out:

- the number of OpenCL platforms- the names of the OpenCL platforms

You can find a template in:

/scratch/courses01/templates/opencl_platform.c

You may find the following function definitions useful:

cl_int clGetPlatformInfo( cl_platform_id platform,cl_platform_info param_name,size_t param_value_size,void* param_value,size_t param_value_size_ret )

cl_int clGetPlatformIDs( cl_uint num_entries,cl_platform_id* platforms,cl_uint* num_platforms)

param_name : CL_PLATFORM_PROFILE, CL_PLATFORM_VERSION

OpenCL Devices

Device:

An OpenCL device consists of a global memory and a numberof compute units, each in turn containing a number of

processing elements and a local memory .

GlobalMemory

Compute UnitDevice

PE PE PE PE

Compute Unit

PE PE PE PE

Compute Unit

PE PE PE PE

OpenCL Devices : clGetDeviceIDs

cl_int clGetDeviceIDs( cl_platform_id platform,cl_device_type device_type,cl_uint num_entries,cl_device_id* devices,size_uint* num_devices)

platform : platform ID of desired platformdevice_type : CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU, etcnum_entries : size of pointer allocationdevices : pointer to return device IDsnum_devices : pointer to return number of devices

The following OpenCL routine is used obtain the number ofdevices, and their Ids, available in a platform:

Arguments

Returns

OpenCL Devices : clGetDeviceIDs

// get number of devicescl_uint num_devices;clErr = clGetDeviceIDs(platform,CL_DEVICE_TYPE_GPU,0,NULL,&num_devices);checkErr(clErr,__FILE__,__LINE__);

printf("\nOpenCL GPU Devices found: %i\n",num_devices);

// get device IDscl_device_id devices[num_devices];clErr = clGetDeviceIDs(platform,CL_DEVICE_TYPE_GPU,num_devices,devices,NULL);checkErr(clErr,__FILE__,__LINE__);

Code Example:

O CL D i lG D i I f

OpenCL Devices : clGetDeviceInfo

cl_int clGetDeviceInfo( cl_device_id device,cl_device_info param_name,size_t param_value_size,void* param_value,size_t param_value_size_ret )

device : the device to query param_name : CL_DEVICE_NAME, and many more param_value_size : size of memory pointed to by param_value param_value : pointer to memory to store return value param_value_size_ret : returns the size in bytes of data being queried

The following OpenCL routine is used to query devices:

Arguments

Returns

O CL D i lG tD i I f

CL_DEVICE_TYPECL_DEVICE_VENDOR_IDCL_DEVICE_MAX_COMPUTE_UNITSCL_DEVICE_MAX_WORK_ITEM_DIMENSIONSCL_DEVICE_MAX_WORK_ITEM_SIZESCL_DEVICE_MAX_WORK_GROUP_SIZECL_DEVICE_PREFERRED_VECTOR_WIDTH_CHARCL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORTCL_DEVICE_PREFERRED_VECTOR_WIDTH_INT

CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONGCL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLECL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOATCL_DEVICE_PREFERRED_VECTOR_WIDTH_HALFCL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLECL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF

There are a long list of device properties, they are listed inthe OpenCL specification document:

O CL D i lG tD i I f

// get device infofor (i=0; i<num_devices; i++){

size_t size;clErr = clGetDeviceInfo(devices[i],CL_DEVICE_NAME,0,NULL,&size);

checkErr(clErr,__FILE__,__LINE__);char name[size];clErr = clGetDeviceInfo(devices[i],CL_DEVICE_NAME,size,name,NULL);checkErr(clErr,__FILE__,__LINE__);printf("\tDevice %i: %s\n",i,name);

Code Example:

OpenCL Programming Task Device Query

OpenCL Programming Task : Device Query

Write a program that prints out:- the names of the devices in the platform

/scratch/courses01/templates/opencl_device.c

cl_int clGetDeviceIDs( cl_platform_id platform,cl_device_type device_type,cl_uint num_entries,cl_device_id* devices,size_uint* num_devices)

device_type : CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU, etccl_int clGetDeviceInfo( cl_device_id device,

cl_device_info param_name,size_t param_value_size,void* param_value,size_t param_value_size_ret )

param_name : CL_DEVICE_NAME, etc

OpenCL Context

Context:

An OpenCL context are a collection of OpenCL concepts thatare associated with a group of devices, including Command Queues, Device Buffers, Programs and Kernels.

Context

Device

DeviceBuffer

Program

Kernel

CommandQueue

OpenCL Context : clCreateContext

cl_context clCreateContext ( cl_context_properties *properties,cl_uint num_devices,cl_device_id *devices,void (CL_CALLBACK *pfn_notify)

(const char *errinfo,const void *private_info, size_t cb,void *user_data),

void *user_data,cl_int *errcode_ret)

The following OpenCL routine is used to create contexts:

Arguments

Returns

properties : the desired properties of the context (more next slide)num_devices : the number of devices in the context

devices : pointer to a list of IDs of the desired devices pfn_notify : pointer to callback functionuser_data : pointer to user defined data to be returned by the callbackerrcode_ret : pointer to value to return error code.

The requested OpenCL context, assuming no errors were returned.

OpenCL Context Properties

// define desired context properties listcl_context_properties properties[] = {CL_CONTEXT_PLATFORM,

(cl_context_properties) platform,

// create contextcl_context context = clCreateContext(properties,1,&device,NULL,NULL,&clErr);checkErr(clErr,__FILE__,__LINE__);

The cl_context_properties type is a zero terminated list of

Context properties and their desired values.

As a minimum, the corresponding platform should beprovided:

OpenCL Context : clReleaseContext

When a context is no longer required, it should be released:

Arguments

Returns

context : the context to release

cl_int clReleaseContext (cl_context context)

Either CL_SUCCESS or an error code

Example Code:

// release contextclErr = clReleaseContext(context);checkErr(clErr,__FILE__,__LINE__);

OpenCL Programming Task : Context

Write a program that:- creates an OpenCL context

/scratch/courses01/templates/opencl_context.c

cl_context clCreateContext ( cl_context_properties *properties,cl_uint num_devices,cl_device_id *devices,void (CL_CALLBACK *pfn_notify)

(const char *errinfo,const void *private_info, size_t cb,void *user_data),void *user_data,cl_int *errcode_ret)

cl_context_properties properties[] = {CL_CONTEXT_PLATFORM,(cl_context_properties) platform,0};

OpenCL Command Queue

Command Queue:

An OpenCL command queues provide a mechanism to queue

commands that operate on the various objects of a context.

The command queue can either act as a simple First In First Out(FIFO) queue, or use events to create command dependencies.

Context

Device

DeviceBuffer

Program

Kernel

CommandQueue

OpenCL Command Queue : clCreateCommandQueue

cl_command_queue clCreateCommandQueue (cl_context context,cl_device_id device,cl_command_queue_properties properties,cl_int* errcode_ret)

The following OpenCL routine is used to create queues:

Arguments

Returns

context : the context for the command queuedevice : the device that is the target of the commands

properties : CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, etcerrcode_ret : pointer to value to return an error code

The requested OpenCL command queue

OpenCL Command Queue : clReleaseCommandQueue

cl_int clReleaseCommandQueue (cl_command_queue command_queue)

The following OpenCL routine is used to release queues:

Arguments

Returns

command_queue : the queue to release

// create command queuecl_command_queue queue = clCreateCommandQueue(context,device,0,&clErr);checkErr(clErr,__FILE__,__LINE__);

// release command queueclErr = clReleaseCommandQueue(queue);checkErr(clErr,__FILE__,__LINE__);

Code Example:

OpenCL Programming Task : Command Queue

p g g Q

Write a program that:

- creates and releases a command queue

/scratch/courses01/templates/opencl_queue.c

You may find the following function definitions useful:cl_command_queue clCreateCommandQueue (cl_context context,

cl_device_id device,cl_command_queue_properties properties,cl_int* errcode_ret)

cl_int clReleaseCommandQueue (cl_command_queue command_queue)

OpenCL Buffers

Buffer:

An OpenCL buffer is a memory object that resides in device globalmemory. There are also many other types of memory objects, thatsupport various data structures.

Buffers are attached to contexts and are associated with devices.

GlobalMemory

Compute UnitDevice

PE PE PE PE

Compute Unit

PE PE PE PE

Compute Unit

PE PE PE PE

DeviceBuffer

OpenCL Buffers : clCreateBuffer

cl_mem clCreateBuffer ( cl_context context,cl_mem_flags flags,size_t size,void *host_ptr,cl_int *errcode_ret)

The following OpenCL routine is used to create buffers:

Arguments

Returns

context : the context for the buffercl_mem_flags : CL_MEM_READ_WRITE, CL_MEM_READ_ONLY, etcsize : size of the buffer in byteshost_pointer : pointer to host memory to populate buffer (optional)

errcode_ret : pointer to value to return an error code

The requested OpenCL buffer, as a cl_mem object.

OpenCL Buffers : clReleaseMemObject

cl_int clReleaseMemObject (cl_mem memobject)

The following OpenCL routine is used to release buffers:

Arguments

Returns

memobject : the memobject to release

OpenCL Buffers : clWriteBuffer

cl_int clEnqueueWriteBuffer ( cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_write,size_t offset,size_t size,const void *ptr,cl_uint num_events_in_wait_list,const cl_event *event_wait_list,cl_event *event)

The following OpenCL routine is used to write data fromhost memory into a buffer:

Arguments

Returns

command_queue : the queue to enqueue the write tobuffer : the buffer to write toblocking_write : whether this function blocks until the transfer is completeoffset : how far into the buffer to begin writing

size : the size of the transfer in bytes ptr : the location in host memory of the datanum_events_in_wait_list : number of events the write is dependent onevent_wait_list : list of events the write is dependent onevent : returns an event corresponding to this write

OpenCL Buffers : clReadBuffer

cl_int clEnqueueReadBuffer ( cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_read,size_t offset,size_t size,const void *ptr,cl_uint num_events_in_wait_list,const cl_event *event_wait_list,cl_event *event)

The following OpenCL routine is used to read data fromhost memory into a buffer:

Arguments

Returns

command_queue : the queue to enqueue the read tobuffer : the buffer to read fromblocking_read : whether this function blocks until the transfer is complete (CL_TRUE/FALSE)offset : how far into the buffer to begin reading

size : the size of the transfer in bytes ptr : the location in host memory to put the datanum_events_in_wait_list : number of events the read is dependent onevent_wait_list : list of events the read is dependent onevent : returns an event corresponding to this read

OpenCL Buffers

// create device buffercl_mem device_values = clCreateBuffer(context,CL_MEM_READ_WRITE,bsize,

NULL,&clErr);checkErr(clErr,__FILE__,__LINE__);

// write image to device bufferclErr = clEnqueueWriteBuffer(queue,device_values,CL_TRUE,0,bsize,(void*)host_values,0,NULL,NULL);

checkErr(clErr,__FILE__,__LINE__);

// read image from device bufferclErr = clEnqueueReadBuffer(queue,device_values,CL_TRUE,0,bsize,

(void*)host_values,0,NULL,NULL);checkErr(clErr,__FILE__,__LINE__);

// release device bufferclErr = clReleaseMemObject(device_values);checkErr(clErr,__FILE__,__LINE__);

Code Example:

OpenCL Programming Task : Buffers

Write a program that:- creates a two arrays on the host, and populates one

- writes the populated array to a device buffer- reads the device buffer to the other array on the host

/scratch/courses01/templates/opencl_buffers.c

cl_int clEnqueueWriteBuffer ( cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_write,size_t offset,size_t size,const void *ptr,cl_uint num_events_in_wait_list,

const cl_event *event_wait_list,cl_event *event)

cl_mem clCreateBuffer ( cl_context context,cl_mem_flags flags,size_t size,void *host_ptr,cl_int *errcode_ret)

OpenCL Programs

Program:

An OpenCL program is a set of kernel sources written as functions

defined with the __kernel qualifier, and binaries compiled for specificdevice architectures.

Context

Device

DeviceBuffer

Program

Kernel

CommandQueue

OpenCL Programs : clCreateProgramWithSource

cl_program clCreateProgramWithSource ( cl_context context,cl_uint count,const char** strings,const size_t* lengths,cl_int* errcode_ret)

The following OpenCL routine is used to create programs:

Arguments

Returns

context : the context for the programcount : the number of strings containing the sourcestrings : pointer to array of pointers to the stringslengths : pointer to array of the string lengths (NULL if \0 terminated)

errcode_ret : pointer to value to return an error code

The requested OpenCL program

Options for program kernel source code:

1) Include the kernels as strings in the host source file- have to code within quotes- need to recompile to change kernel source- guaranteed to have kernel source

2) Read in the kernels from files at runtime- can code normally- can change kernels without recompiling- need to ensure path to files is correct

Kernel Source String Example:

const char* source ="__kernel void zeroValues(__global int* values, int imax)\n""{\n"" // thread index and total\n"" int idx = get_global_id(0);\n"

" int idtotal = get_global_size(0);\n""\n"" // zero values\n"" int i;\n"" for(i=idx;i<imax;i+=idtotal)\n"" {\n"" values[i] = 0;\n"" }\n""}\n\0";

Kernel Source File Example:

__kernel void zeroValues(__global int* values, int imax){

// thread index and totalint idx = get_global_id(0);

int idtotal = get_global_size(0);

// zero valuesint i;for(i=idx;i<imax;i+=idtotal){

values[i] = 0;}

OpenCL Programs : clBuildProgram

cl_int clBuildProgram (cl_program program,cl_uint num_devices,const cl_device_id* device_list,const char* options,

void (CL_CALLBACK *pfn_notify) (cl_program program,void *user_data),void* user_data)

Use clBuildProgram to compile and link the kernel source:

Arguments

Returns

program : the program to buildnum_devices : the number of devices target

device_list : list of devices to targetoptions : compiler flags pfn_notify : pointer to callback when done (blocking call if NULL)user_data : data to provide in callback

CL_SUCCESS or an error code

OpenCL Programs : clGetProgramBuildInfo

cl_int clGetProgramBuildInfo (cl_program program,cl_device_id device,cl_program_build_info param_name,size_t param_value_size,void* param_value,

size_t* param_value_size_ret)

Use clGetProgramBuildInfo to get the compiler log:

Arguments

Returns

program : the program that was builtdevice : the device that the kernels were compiled for

param_name : CL_PROGRAM_BUILD_LOG, etc

param_value_size : size of memory pointed to by param_value param_value : pointer to memory to store return value param_value_size_ret : returns the size in bytes of data being queried

OpenCL Programs : clCreateKernels

cl_kernel clCreateKernel ( cl_program program,const char* kernel_name,cl_int* errcode_ret)

Use clCreateKernels to define the kernel entry points:

Arguments

Returns

program : the program that was builtkernel_name : the name of the kernel functionerrcode_ret : pointer to value to return an error code

The OpenCL kernel corresponding to the kernel name

OpenCL Programs : clReleaseProgram, clReleaseKernel

cl_int clReleaseKernel (cl_kernel kernel)

Use clReleaseKernel to release the kernel:

Arguments

Returns

kernel : the kernel to release program : the program to release

cl_int clReleaseProgram (cl_program program)

Use clReleaseProgram to release the program:

OpenCL Programs and Kernels

// create program from sourcecl_program program = clCreateProgramWithSource(context,1,&source,NULL,&clErr);checkErr(clErr,__FILE__,__LINE__);

// compile programclErr = clBuildProgram(program, 1, &device,"",NULL,NULL);checkErr(clErr,__FILE__,__LINE__);

// print build log

clErr = clGetProgramBuildInfo(program,device,CL_PROGRAM_BUILD_LOG,0,NULL,&size);checkErr(clErr,__FILE__,__LINE__);char build_log[size];clErr = clGetProgramBuildInfo(program,device,CL_PROGRAM_BUILD_LOG,size,build_log,NULL);checkErr(clErr,__FILE__,__LINE__);printf("\nBuild Log:\n\n%s\n\n",build_log);

// create kernelcl_kernel kernel = clCreateKernel(program,"invertValues",&clErr);

// release kernelclErr = clReleaseKernel(kernel);checkErr(clErr,__FILE__,__LINE__);

// release programclErr = clReleaseProgram(program);checkErr(clErr,__FILE__,__LINE__);

Code Example:

OpenCL Programming Task : Programs and Kernels

Write and build a kernel that would:- invert and array of integers valued 0-255

/scratch/courses01/templates/opencl_program.c

cl_program clCreateProgramWithSource ( cl_context context,cl_uint count,

const char** strings,const size_t* lengths,cl_int* errcode_ret)

cl_kernel clCreateKernel ( cl_program program,const char* kernel_name,

cl_int* errcode_ret)

cl_int clBuildProgram (cl_program program,cl_uint num_devices,

const cl_device_id* device_list,const char* options,void (CL_CALLBACK *pfn_notify) (cl_program program,void *user_data),void* user_data)

OpenCL Kernel Execution

To execute the kernel on the device, we must

1) Set the Kernel Arguments

2) Determine the Thread Topology (NDRange)

3) Enqueue the Kernel Execution

Context

Device

DeviceBuffer

Program

Kernel

CommandQueue

OpenCL Kernels : clSetKernelArg

cl_int clSetKernelArg ( cl_kernel kernel,cl_uint arg_index,size_t arg_size,const void* arg_value)

Use clSetKernelArg to specify the kernel arguments:

Arguments

Returns

kernel : the kernel the argument belongs toarg_index : the index of the argumentarg_size : the size of the argumentarg_value : a pointer to the value of the argument

OpenCL Setting Kernel Arguments

int imax = 1024;

// create device buffer

cl_mem device_values = clCreateBuffer ...checkErr(clErr,__FILE__,__LINE__);

// set kernel argumentsclErr = clSetKernelArg(kernel,0,sizeof(cl_mem),&device_values);

checkErr(clErr,__FILE__,__LINE__);clErr = clSetKernelArg(kernel,1,sizeof(int),&imax);checkErr(clErr,__FILE__,__LINE__);

Code Example:

OpenCL Thread Topology

OpenCL uses a scalable programming model that uses a NDRange of multiple worgroups that contain the

workitems that will execute on the device.

NDRange

Workgroup 0

W0 W1 W2 W3

Workgroup 1 Workgroup 2

Each workitem has access to functions that return thedimensions of the NDRange and Workgroupd, as well asits index within them.

uint get_work_dim()size_t get_global_size(uint d)size_t get_global_id(uint d)size_t get_local_size(uint d)size_t get_local_id(uint d)

W0 W1 W2 W3 W0 W1 W2 W3

OpenCL Thread Topology

The NDRange is divided into workgroups so that they canbe dynamically allocated to the compute units.

Multithreaded OpenCL Program

WG 0 WG 1 WG 2 WG 3 WG 4 WG 5 WG 6 WG 7

2x compute unit device

CU 0 CU 1

WG 0 WG 1

WG 2 WG 3

WG 4 WG 5

WG 6 WG 7

CU 0 CU 1

WG 0 WG 1 WG 2 WG 3

WG 4 WG 5 WG 6 WG 7

CU 0 CU 1

OpenCL Thread Topology Implications

The workgroup size must consider themultiprocessor architecture, with some

consideration for future changes.

Just consider a few workgroupsrunning on a single compute unit.

What does the workgroup sizeeffect?

CUDA Thread Topology Implications

The major consideration in choosing theNDRange size is the number of compute units,

with some consideration for future changes.

Just consider all workgroupsrunning on all the multiprocessor.

What does the number of workgroups in the NDRange effect?

CU 0 CU 1

WG 0 WG 1 WG 2 WG 3

WG 4 WG 5 WG 6 WG 7

CU 0 CU 1

OpenCL Kernels : clEnqueueNDRangeKernel

cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,

cl_kernel kernel,cl_uint work_dim,const size_t * global_work_offset,const size_t * global_work_size,const size_t * local_work_size,cl_uint num_events_in_wait_list,const cl_event * event_wait_list,

cl_event * event)

Use clEnqueueNDRangeKernel to queue the kernel:

Arguments

Returns

command_queue : the queue to submit the kernel tokernel : the kernel to submitwork_dim : the dimensions of the thread topologyglobal_work_offset : a pointer to an array of offsets to the global indicesglobal_work_size : a pointer to an array of sizes of the global NDRange

local_work_size : a pointer to an array of sizes of the local workgroupnum_events_in_wait_list : number of events the kernel exectution is dependent onevent_wait_list : list of events the kernel execution is dependent onevent : returns an event corresponding to this kernel execution

OpenCL Kernel Execution

// enqueue kernelcl_uint dim = 1;size_t offset = 0;size_t local_size = 128;size_t global_size = 4*14*local_size;clErr = clEnqueueNDRangeKernel(queue,kernel,dim,&offset,&global_size,&local_size,

0,NULL,NULL);

Code Example:

OpenCL Programming Task : Invert Kernel

W it th t

Write a program that:- generates an array of at least a thousand values,

between 0 and 255

- print the first few values of the array- inverts the array on the GPU (subtract values from 256)- and prints the first few new values of the array

You can find template files at:

/scratch/courses01/templates/opencl_inverse.cYou may find the following definitions useful:

cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,cl_kernel kernel,

cl_uint work_dim,const size_t * global_work_offset,const size_t * global_work_size,const size_t * local_work_size,cl_uint num_events_in_wait_list,const cl_event * event_wait_list,cl_event * event)

OpenCL Programming Task : Sum Kernel

Write a program that:- generates an array of at least a million values,

between 0 and 255- sums the array using a loop on the CPU- sums the array using the GPU- prints the two results

Copy your invert code as a starting point.

Hints:- each workitem can add some numbers together- you can synchronize workitems by stopping the kernel

- you may need more than one device buffer allocation- if your array is large enough, you may need to consider

numerical precision.

Further OpenCL Concepts

C-extensions in the kernel language for vectorsLocal memory for workitem communication within workgroups

Workgroup and device-level synchronisationCoalescing global memory accessBranching issuesMemory stalls and arithmetic intensityOverlaping kernels with host-device transfersPinned memory host-device transfers

Managing compute locality in algorithmsGraphical data-types and hardware accelerationGraphics API interoperabilityMode switchingUsing OpenCL events

cjharris gpu computing opencl

Documents

gainward gtx 570 1280mb dual dvi - bt broadband · pdf...

parallel programming concepts gpu computing … programming...

introduction to gpu computing with cuda and opencl...minimal...

gpu-computing mit cuda und opencl

gpu-programmierung: opencl€¦ · einsatzgebiete von...

opencl for programming gpus - sharcnet · 2017-04-24 ·...

parallel programming concepts gpu computing with opencl

introduction to gpu computing with...

introduction to gpu computing with opencl · introduction...

opencl 1.1 enhancements for multi-gpu performance€¦ · -...

gpu rigid body simulation using opencl

opencl on the gpu - nvidia€¦ · · 2009-10-05opencl on...

gpu computing

the open computing language...

nvidia gpu computing webinars best practises for opencl...

gpu, gp-gpu, gpu computing

cuda & opencl gpuコンピューティングって何？

embedding opencl in c++ for expressive gpu...

gpu-computing mit cuda und opencl in der praxis

gpu computing with cuda -...