opencl - interdisciplinary | innovativekorobkin/tmp/sc10/tutorials/... · 2010. 11. 21. ·...

119
- Page 1 OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel

Upload: others

Post on 08-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 1

OpenCL An Introduction for HPC programmers

Benedict Gaster, AMD

Tim Mattson, Intel

Page 2: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 2

Preliminaries:

• Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial. - We are not speaking for our employers. - We are not speaking for Khronos

• We take our tutorials VERY seriously: - Help us improve … give us feedback and tell us how you would make this tutorial even better.

Page 3: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 3

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Radix Sort:

• A survey of of OpenCL 1.1

Page 4: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 4

It’s a Heterogeneous world

GMCH GPU

ICH

CPU CPU

DRAM

GMCH = graphics memory control hub, ICH = Input/output control hub

• A modern platform Includes:

–One or more CPUs

–One or more GPUs

–DSP processors

–… other?

OpenCL lets Programmers write a single portable program that

uses ALL resources in the heterogeneous platform

Page 5: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 5

Microprocessor trends

IBM Cell

NVIDIA Tesla C1060 Intel SCC Processor

ATI RV770

3rd party names are the property of their owners.

Individual processors are many core (and often heterogeneous) processors.

80 cores 240 cores

1 CPU + 6 cores

160 cores 48 cores

The Heterogeneous many-core challenge: How are we to build a

software ecosystem for the Heterogeneous many core platform?

Page 6: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 6

Industry Standards for Programming Heterogeneous Platforms

CPUs Multiple cores driving

performance increases

GPUs Increasingly general purpose

data-parallel computing

Graphics APIs and Shading Languages

Multi-processor programming –

e.g. OpenMP

Emerging Intersection

Heterogeneous Computing

OpenCL – Open Computing Language

Open, royalty-free standard for portable, parallel programming of heterogeneous

parallel computing CPUs, GPUs, and other processors

Page 7: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 7

The BIG idea behind OpenCL

•Replace loops with functions (a kernel) executing at each point in a problem domain.

- E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

void

trad_mul(int n,

const float *a,

const float *b,

float *c)

{

int i;

for (i=0; i<n; i++)

c[i] = a[i] * b[i];

}

Traditional loops

kernel void

dp_mul(global const float *a,

global const float *b,

global float *c)

{

int id = get_global_id(0);

c[id] = a[id] * b[id];

} // execute over “n” work-items

Data Parallel OpenCL

Page 8: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 8

The origins of OpenCL

AMD

ATI

Merged,

needed

commonality

across

products

Nvidia GPU vendor -

wants to steel mkt

share from CPU

Intel CPU vendor -

wants to steel mkt

share from GPU

Wrote a

rough draft

straw man

API

was tired of recoding for

many core, GPUs.

Pushed vendors to

standardize.

Apple

Ericsson

Sony

EA

Nokia

Khronos

Compute

group formed

Freescale

TI

IBM

+ many

more

Dec 2008 Third party names are the property of their owners.

Page 9: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 9

OpenCL Working Group within Khronos

• Diverse industry participation … - Processor vendors, system OEMs, middleware vendors, application

developers.

• OpenCL became an important standard ―on release‖ by virtue of the market coverage of the companies behind it.

Page 10: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 10

OpenCL Timeline

OpenCL working group formed within

Khronos

Khronos publicly releases

OpenCL 1.0 specification

During 2H09 Multiple conformant implementations

ship across a diverse range of platforms.

Jun08 Dec08 Jun10

Khronos publicly releases OpenCL 1.1 specification.

Conformant implementations available shortly thereafter

• 6 months to get from ―strawman‖ to final draft.

• Rapid innovation required to match pace of HW innovation

• 18 months from 1.0 to 1.1 … Backwards compatibility to protect SW investments.

• OpenCL 1.2 expected around Dec’12

Page 11: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 11

OpenCL: From cell phone to supercomputer • OpenCL Embedded profile for

mobile and embedded silicon - Relaxes some data type and

precision requirements - Avoids the need for a separate

“ES” specification

• Khronos APIs provide computing support for imaging & graphics

- Enabling advanced applications in, e.g., Augmented Reality

• OpenCL will enable parallel computing in new markets

- Mobile phones, cars, avionics

11

A camera phone with GPS processes

images to recognize buildings and

landmarks and provides relevant

data from internet

Page 12: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 12

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Reduction:

- synchronization

• A survey of the rest of OpenCL

Page 13: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 13

OpenCL Platform Model

• One Host + one or more Compute Devices - Each Compute Device is composed of one or more Compute Units

- Each Compute Unit is further divided into one or more Processing Elements

Page 14: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 14

Execution model:

• OpenCL execution model … define a problem domain and execute a kernel invocation for each point in the domain - E.g., process a 1024 x 1024 image: Global problem dimensions:

1024 x 1024 = 1 kernel execution per pixel: 1,048,576 total kernel executions

void

scalar_mul(int n,

const float *a,

const float *b,

float *c)

{

int i;

for (i=0; i<n; i++)

c[i] = a[i] * b[i];

}

Scalar kernel void

dp_mul(global const float *a,

global const float *b,

global float *c)

{

int id = get_global_id(0);

c[id] = a[id] * b[id];

} // execute over “n” work-items

Data Parallel

Page 15: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 15

An N-dimension domain of work-items

• Global Dimensions: 1024 x 1024 (whole problem space)

• Local Dimensions: 128 x 128 (work group … executes together)

1024

10

24

Synchronization between work-items

possible only within workgroups:

barriers and memory fences

Cannot synchronize outside

of a workgroup

• Choose the dimensions that are “best” for your algorithm

Page 16: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 16

Examples: Work-Items and

Workgroups

input

get_global_size = 26

get_work_dim = 1

get_local_size = 13

get_local_id = 8

get_num_groups = 2 workgroups

get_group_id = 0

get_global_id = 21

6 1 1 0 9 2 4 1 1 9 7 6 1 2 2 1 9 8 4 1 9 2 0 0 7 8

Page 17: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 17

get_global_id(0)

10

Kernel

• A data-parallel function executed for each work-item

kernel void square(

global float* input,

global float* output)

{

int i = get_global_id(0);

output[i] = input[i] * input[i];

}

Input

Output 36 1 1 0 81 4 16 1 1 81 36 1 4 4 1 81 64 16 1 81 4 0 0 49 64

6 1 1 0 9 2 4 1 1 9 7 6 1 2 2 1 9 8 4 1 9 2 0 0 7 8

49

Page 18: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 18

OpenCL Memory Model

• Private Memory - Per work-item

• Local Memory - Shared within a workgroup

• Local Global/Constant Memory - Visible to all workgroups

• Host Memory - On the CPU

Workgroup

Work-Item

Computer Device

Work-Item

Workgroup

Host

Private Memory

Private Memory

Local Memory Local Memory

Global/Constant Memory

Host Memory

• Memory management is explicit You must move data from host -> global -> local and back

Work-Item Work-Item

Private Memory

Private Memory

Page 19: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 19

Memory Consistency

• ―OpenCL uses a relaxed consistency memory model; i.e. - the state of memory visible to a work-item is not guaranteed to be

consistent across the collection of work-items at all times.”

• Within a work-item: - Memory has load/store consistency

• Within a work-group: - Local memory is consistent between work-items at a barrier

• Global memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groups

• Consistency of memory shared between commands (e.g. kernel invocations) are enforced through synchronization (events)

Page 20: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 20

OpenCL C Language

• Derived from ISO C99 - No standard C99 headers, function pointers, recursion, variable length arrays, and bit

fields

• Additions to the language for parallelism - Work-items and workgroups - Vector types - Synchronization

• Address space qualifiers

• Optimized image access

• Built-in functions

Page 21: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 21

Data Types

• Scalar data types - char , uchar, short, ushort, int, uint, long, ulong, float, (double?) - bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)

• Image types - image2d_t, image3d_t, sampler_t

• Vector data types

Page 22: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 22

Vector Types

• Portable

• Vector length of 2, 4, 8, and 16

• char2, ushort4, int8, float16, double2, …

• Endian safe

• Aligned at vector length

• Vector operations (elementwise) and built-in functions

Page 23: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 23

• Vector ops

2 3 -7 -7

Vector Operations

-7 -7 -7 -7 int4 vi0 = (int4) -7;

0 1 2 3 int4 vi1 = (int4)(0, 1, 2, 3);

vi0.lo = vi1.hi;

int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); 2 3 -7 -7 0 1 1 3

0 1 2 3

2 4 -5 -4

+

vi0 += vi1;

vi0 = abs(vi0); 2 4 5 4

2 3 -7 -7

• Vector components

• Vector literal

Page 24: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 24

Contexts and Queues

• Contexts are used to contain and manage the state of the ―world‖

• Kernels are executed in contexts defined and manipulated by the host

- Devices - Kernels - OpenCL functions - Program objects - kernel source and executable - Memory objects

• Command-queue - coordinates execution of kernels - Kernel execution commands - Memory commands - transfer or mapping of memory object data - Synchronization commands - constrains the order of commands

• Applications queue compute kernel execution instances - Queued in-order - Executed in-order or out-of-order - Events are used to synchronize execution instances

Page 25: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 25

arg [0]

value

arg [1]

value

arg [2]

value

arg [0]

value

arg [1]

value

arg [2]

value

In

Order

Queue

Out of

Order

Queue

GPU

Context

__kernel void

dp_mul(global const float *a,

global const float *b,

global float *c)

{

int id = get_global_id(0);

c[id] = a[id] * b[id];

}

dp_mul

CPU program binary

dp_mul

GPU program binary

Programs Kernels

arg[0] value

arg[1] value

arg[2] value

Images Buffers In

Order

Queue

Out of

Order

Queue

Compute Device

GPU

CPU

dp_mul

Programs Kernels Memory Objects Command Queues

OpenCL summary

Page 26: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 26

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Reduction:

- synchronization

• A survey of the rest of OpenCL

Page 27: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 27

Example: vector addition

• The ―hello world‖ program of data parallel programming is a program to add two vectors

C[i] = A[i] + B[i] for i=1 to N

• For the OpenCl solution, there are two parts - Kernel code - Host code

Page 28: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 28

Vector Addition - Kernel

__kernel void vec_add (__global const float *a,

__global const float *b,

__global float *c)

{

int gid = get_global_id(0);

c[gid] = a[gid] + b[gid];

}

Page 29: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 29

Vector Addition - Host Program // create the OpenCL context on a GPU device

cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

// get the list of GPU devices associated with context

clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

NULL, &cb);

devices = malloc(cb);

clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue

cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects

memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);}

memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL);

memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

sizeof(cl_float)*n, NULL, NULL);

// create the program

program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

// build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel

kernel = clCreateKernel(program, “vec_add”, NULL);

// set the args values

err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

sizeof(cl_mem));

err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],

sizeof(cl_mem));

err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],

sizeof(cl_mem));

// set work-item dimensions

global_work_size[0] = n;

// execute kernel

err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

// read output array

err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Page 30: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 30

Vector Addition - Host Program // create the OpenCL context on a GPU device

cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

// get the list of GPU devices associated with context

clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

NULL, &cb);

devices = malloc(cb);

clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue

cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects

memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);}

memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL);

memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

sizeof(cl_float)*n, NULL, NULL);

// create the program

program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

// build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel

kernel = clCreateKernel(program, “vec_add”, NULL);

// set the args values

err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

sizeof(cl_mem));

err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],

sizeof(cl_mem));

err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],

sizeof(cl_mem));

// set work-item dimensions

global_work_size[0] = n;

// execute kernel

err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

// read output array

err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Define platform and queues

Define Memory objects

Create the program

Build the program

Create and setup kernel

Execute the kernel

Read results on the host

It’s complicated, but most of this is “boilerplate” and not as bad as it looks.

Page 31: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 31

Platform Layer: Basic discovery

• Platform layer allows applications to query for platform specific features

• Querying platform info Querying devices - clGetDeviceIDs()

- Find out what compute devices are on the system - Device types include CPUs, GPUs, or Accelerators

- clGetDeviceInfo() - Queries the capabilities of the discovered compute devices

such as: - Number of compute cores - Maximum work-item and work-group size - Sizes of the different memory spaces - Maximum memory object size

Page 32: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 32

Platform Layer: Contexts

• Creating contexts - Contexts are used by the OpenCL runtime to manage

objects and execute kernels on one or more devices - Contexts are associated to one or more devices

- Multiple contexts could be associated to the same device

- clCreateContext() and clCreateContextFromType() returns a handle to the created contexts

Page 33: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 33

Platform layer: Command-Queues

• Command-queues store a set of operations to perform

• Command-queues are associated to a context

• Multiple command-queues can be created to handle independent commands that don’t require synchronization

• Execution of the command-queue is guaranteed to be completed at sync points

Queue Queue

Context

GPU

CPU

Page 34: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 34

VecAdd: Context, Devices, Queue // create the OpenCL context on a GPU device

cl_context context = clCreateContextFromType(

0, // platform ID CL_DEVICE_TYPE_GPU, // Ask for a GPU NULL, // error callback NULL, // user data for callback NULL); // error code

// get the list of GPU devices associated with context

size_t cb;

clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);

cl_device_id *devices = malloc(cb);

clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue

cl_cmd_queue cmd_queue = clCreateCommandQueue(context,

devices[0], // Use the first GPU device

0, // default options

NULL); // error code

Page 35: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 35

Memory Objects

• Buffers - Simple chunks of memory - Kernels can access however they like (array, pointers, structs) - Kernels can read and write buffers

• Images - Opaque 2D or 3D formatted data structures - Kernels access only via read_image() and write_image() - Each image can be read or written in a kernel, but not both

Page 36: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 36

Creating Memory Objects

• Memory objects are created within an associated context - clCreateBuffer(), clCreateImage2D(), and

clCreateImage3D()

• Memory can be created as read only, write only, or read-write

• Where objects are created in the platform memory space can be controlled - Device memory - Device memory with data copied from a host pointer - Host memory - Host memory associated with a pointer

- Memory at that pointer is guaranteed to be valid at synchronization points

Page 37: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 37

VecAdd: Create Memory Objects cl_mem memobjs[3];

// allocate input buffer memory objects

memobjs[0] = clCreateBuffer(context,

CL_MEM_READ_ONLY | // flags

CL_MEM_COPY_HOST_PTR,

sizeof(cl_float)*n, // size

srcA, // host pointer

NULL); // error code

memobjs[1] = clCreateBuffer(context,

CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,

sizeof(cl_float)*n, srcB, NULL);

// allocate output buffer memory object

memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,

sizeof(cl_float)*n, NULL, NULL);

Page 38: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 38

Build the Program object • The program object encapsulates:

- A context - The program source/binary - List of target devices and build options

• The Build process … to create a program object - clCreateProgramWithSource() - clCreateProgramWithBinary()

Program kernel void

horizontal_reflect(read_only image2d_t src,

write_only image2d_t dst)

{

int x = get_global_id(0); // x-coord

int y = get_global_id(1); // y-coord

int width = get_image_width(src);

float4 src_val = read_imagef(src, sampler,

(int2)(width-1-x, y));

write_imagef(dst, (int2)(x, y), src_val);

}

Compile for

GPU

Compile for

CPU

GPU

code

CPU

code

Kernel Code

Page 39: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 39

VecAdd: Create and Build the Program // create the program

cl_program program = clCreateProgramWithSource(

context,

1, // string count

&program_source, // program strings

NULL, // string lengths

NULL); // error code

// build the program

cl_int err = clBuildProgram(program,

0, // device num within the device list

NULL, // device list

NULL, // options

NULL, // notifier callback function ptr

NULL); // user data for callback function

Page 40: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 40

Kernel Objects

• Kernel objects encapsulate

- Specific kernel functions declared in a program - Argument values used for kernel execution

• Creating kernel objects

- clCreateKernel() - creates a kernel object for a single function in a program

• Setting arguments

- clSetKernelArg(<kernel>, <argument index>) - Each argument data must be set for the kernel function - Argument values copied and stored in the kernel object

• Kernel vs. program objects

- Kernels are related to program execution - Programs are related to program source

Page 41: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 41

// create the kernel

cl_kernel kernel = clCreateKernel(program, “vec_add”, NULL);

// set “a” vector argument

err = clSetKernelArg(kernel,

0, // argument index

(void *)&memobjs[0], // argument data

sizeof(cl_mem)); // argument data size

// set “b” vector argument

err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],

sizeof(cl_mem));

// set “c” vector argument

err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],

sizeof(cl_mem));

VecAdd: Create the Kernel and Set the Arguments

Page 42: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 42

Kernel Execution

• A command to execute a kernel must be enqueued to the command-queue

- Command-queue could be explicitly flushed to the device - Command-queues execute in-order or out-of-order

- In-order - commands complete in the order queued and memory is consistent - Out-of-order - no guarantee of (1) when commands are executed or (2) if

memory is consistent … unless specific synchronization is used.

• clEnqueueNDRangeKernel() - Data-parallel execution model - Describes the index space for kernel execution - Requires information on NDRange dimensions and work-group

size • clEnqueueTask()

- Task-parallel execution model (multiple queued tasks) - Kernel is executed on a single work-item

• clEnqueueNativeKernel() - Task-parallel execution model - Executes a native C/C++ function not compiled using the OpenCL

compiler - This mode does not use a kernel object so arguments must be

passed in

Page 43: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 43

size_t global_work_size[1] = n; // set work-item dimensions

// execute kernel

err = clEnqueueNDRangeKernel(cmd_queue, kernel,

1, // Work dimensions

NULL, // must be NULL (work offset)

global_work_size,

NULL, // automatic local work size

0, // no events to wait on

NULL, // event list

NULL); // event for this kernel

VecAdd: Invoke Kernel

Page 44: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 44

Synchronization

• Synchronization - Signals when commands are completed to the host or other

commands in queue - Blocking calls

- Commands that do not return until complete - clEnqueueReadBuffer() can be called as blocking and will block until

complete

- Event objects - Tracks execution status of a command - Some commands can be blocked until event objects signal a

completion of previous command - clEnqueueNDRangeKernel() can take an event object as an

argument and wait until a previous command (e.g., clEnqueueWriteBuffer) is complete

- Queue barriers - queued commands that can block command execution

Page 45: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 45

// read output array

err = clEnqueueReadBuffer( context, memobjs[2],

CL_TRUE, // blocking

0, // offset

n*sizeof(cl_float), // size

dst, // pointer

0, NULL, NULL); // events

VecAdd: Read Output

Page 46: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 46

OpenCL C for Compute Kernels

• Derived from ISO C99 - A few restrictions: recursion, function pointers, functions in C99

standard headers ... - Preprocessing directives defined by C99 are supported

• Built-in Data Types - Scalar and vector data types, Pointers - Data-type conversion functions:

convert_type<_sat><_roundingmode> - Image types: image2d_t, image3d_t and sampler_t

• Built-in Functions — Required - work-item functions, math.h, read and write image - Relational, geometric functions, synchronization functions

• Built-in Functions — Optional - double precision, atomics to global and local memory - selection of rounding mode, writes to image3d_t surface

Page 47: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 47

OpenCL C Language Highlights

• Function qualifiers - “__kernel” qualifier declares a function as a kernel - Kernels can call other kernel functions

• Address space qualifiers - __global, __local, __constant, __private - Pointer kernel arguments must be declared with an address space

qualifier

• Work-item functions - Query work-item identifiers

- get_work_dim(), get_global_id(), get_local_id(), get_group_id()

• Synchronization functions - Barriers - all work-items within a work-group must execute the

barrier function before any work-item can continue - Memory fences - provides ordering between memory operations

Page 48: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 48

OpenCL C Language Restrictions

• Pointers to functions are not allowed

• Pointers to pointers allowed within a kernel, but not as an argument

• Bit-fields are not supported

• Variable length arrays and structures are not supported

• Recursion is not supported

• Writes to a pointer of types less than 32-bit are not supported

• Double types are not supported, but reserved

Page 49: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 49

Vector Addition Kernel

__kernel void vec_add (__global const float *a,

__global const float *b,

__global float *c)

{

int gid = get_global_id(0);

c[gid] = a[gid] + b[gid];

}

Page 50: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 50

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Radix Sort:

• A survey of the rest of OpenCL

Page 51: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 51 51

Linear Algebra • Definition:

- The branch of mathematics concerned with the study of vectors, vector spaces, linear transformations, and systems of linear equations.

• Example: Consider the following system of linear equations x + 2y + z = 1 x + 3y + 3z = 2 x + y + 4z = 6

- This system can be represented in terms of vectors and a matrix as the classic “Ax = b” problem.

1 2 1 x 1

1 3 3 y = 2

1 1 4 z 6

Page 52: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 52 52

Solving Ax=b • LU Decomposition:

- transform a matrix into the product of a lower triangular and upper triangular matrix. It is used to solve a linear system of equations.

1 0 0 1 2 1 1 2 1

1 1 0 0 1 2 = 1 3 3

1 -1 1 0 0 5 1 2 4

Solving for x Given a problem Ax=b

Ax=b LUx=b

Ux=(L-1)b x = (U-1)L-1b

L A U =

Page 53: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 53 53

Linear transformations

• A matrix, A ∈ RMxP multiplies a vector, x ∈ RP to define a linear transformation of that vector into Rm.

• Matrix multiplication defines the composition of two linear transformations on that vector space

Compute C = A*B where

C ∈ RMxN

A ∈ RMxP

B ∈ RPxN

Matrix multiplication is the core building block of Linear Algebra

Page 54: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 54

Matrix Multiplication: Sequential code

void mat_mul(int Mdim, int Ndim, int Pdim, float *A, float *B, float *C)

{

int i, j, k;

for (i=0; i<Ndim; i++){

for (j=0; j<Mdim; j++){

for(k=0;k<Pdim;k++){ //C(i,j) = sum(over k) A(i,k) * B(k,j)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

}

}

} = + *

C(i,j) A(i,:)

B(:,j) C(i,j)

Dot product of a row of A and a column of B for each element of C

Page 55: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 55

Matrix Multiplications Performance

• Results on an Apple laptop with an Intel CPU.

Case MFLOPS

CPU: Sequential C (not OpenCL) 167

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Page 56: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 56

Matrix Multiplication: OpenCL kernel

(1/4) void mat_mul(int Mdim, int Ndim, int Pdim, float *A, float *B, float *C)

{

int i, j, k;

for (i=0; i<Ndim; i++){

for (j=0; j<Mdim; j++){

for(k=0;k<Pdim;k++){ //C(i,j) = sum(over k) A(i,k) * B(k,j)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

}

}

}

Page 57: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 57

Matrix Multiplication: OpenCL kernel

(2/4) void mat_mul(

int Mdim, int Ndim, int Pdim,

float *A, float *B, float *C)

{

int i, j, k;

for (i=0; i<Ndim; i++){

for (j=0; j<Mdim; j++){

for(k=0;k<Pdim;k++){ //C(i,j) = sum(over k) A(i,k) * B(k,j)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

}

}

}

__kernel mat_mul(

const int Mdim, const int Ndim, const int Pdim,

__global float *A, __global float *B, __global float *C)

Mark as a kernel function and specify memory qualifiers

Page 58: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 58

Matrix Multiplication: OpenCL kernel

(3/4) __kernel mat_mul(

const int Mdim, const int Ndim, const int Pdim,

__global float *A, __global float *B, __global float *C)

{

int i, j, k;

for (i=0; i<Ndim; i++){

for (j=0; j<Mdim; j++){

for(k=0;k<Pdim;k++){ //C(i,j) = sum(over k) A(i,k) * B(k,j)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

}

}

}

Remove outer loops and set work item coordinates

i = get_global_id(0);

j = get_global_id(1);

Page 59: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 59

Matrix Multiplication: OpenCL kernel

(4/4) __kernel mat_mul(

const int Mdim, const int Ndim, const int Pdim,

__global float *A, __global float *B, __global float *C)

{

int i, j, k;

i = get_global_id(0);

j = get_global_id(1);

for(k=0;k<Pdim;k++){ //C(i,j) = sum(over k) A(i,k) * B(k,j)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

}

Page 60: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 60

Matrix Multiplication: OpenCL kernel

__kernel mmul(

const int Mdim,

const int Ndim,

const int Pdim,

__global float* A,

__global float* B,

__global float* C)

{

int k;

int i = get_global_id(0);

int j = get_global_id(1);

float tmp;

tmp = 0.0;

for(k=0;k<Pdim;k++)

tmp += A[i*Ndim+k] * B[k*Pdim+j];

C[i*Ndim+j] = tmp;

}

Rearrange a bit and use a local scalar for intermediate C element values (a

common optimization in Matrix Multiplication functions)

Page 61: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 61

Matrix Multiplication host program

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

Page 62: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 62

Matrix Multiplication host program

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

Initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

Note: This isn’t as bad as you might think.

This is almost the same as the host code we wrote for vector add.

It’s “boilerplate” … you get it right once and just re-use it.

Page 63: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 63

Matrix Multiplication host program

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

Declare and

initialize data Build the program, define

the Kernel,and setup

arguments

Setup the platform

Setup buffers and write A and B

matrices to the device memory

Run the kernel and collect results

Page 64: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 64

Matrix Multiplication host program

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

The only parts new to us …

1. 2D ND Range set to dimensions of C matrix

2. Local sizes set to NULL in clEnqueueNDRangeKernel() to tell system

to pick local dimensions for us.

global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out);

Page 65: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 65

Matrix Multiplications Performance

• Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Matrices are stored in global memory.

Device is GeForce® 8600M GT GPU from NVIDIA with a max of 4 compute units

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Case MFLOPS

CPU: Sequential C (not OpenCL) 167

GPU: C(i,j) per work item, all global 511

CPU: C(i,j) per work item, all global 744

Page 66: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 66

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Reduction:

- synchronization

• A survey of the rest of OpenCL

Page 67: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 67

Optimizing Matrix Multiplication • Cost determined by flops and memory movement:

2*n3 = O(n3) Flops

operates on 3*n2 numbers

• To optimize matrix multiplication, we must assure that for every memory movement, we execute as many flops as possible.

• Outer product algorithms are faster, but for pedagogical reasons, lets stick to the simple dot-product algorithm.

= + *

C(i,j) A(i,:)

B(:,j) C(i,j)

Dot product of a row of A and a column of B for each element of C

• We will work with work-item/work-group sizes and the memory model to optimize matrix multiplication

Page 68: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 68

An N-dimension domain of work-items

• Global Dimensions: 1024 x 1024 (whole problem space)

• Local Dimensions: 128 x 128 (work group … executes together)

1024

10

24

Synchronization between work-items

possible only within workgroups:

barriers and memory fences

Cannot synchronize outside

of a workgroup

• Choose the dimensions that are “best” for your algorithm

Page 69: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 69

OpenCL Memory Model

• Private Memory - Per work-item

• Local Memory - Shared within a workgroup

• Local Global/Constant Memory - Visible to all workgroups

• Host Memory - On the CPU

Workgroup

Work-Item

Computer Device

Work-Item

Workgroup

Host

Private Memory

Private Memory

Local Memory Local Memory

Global/Constant Memory

Host Memory

• Memory management is explicit You must move data from host -> global -> local and back

Work-Item Work-Item

Private Memory

Private Memory

Page 70: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 70

Optimizing Matrix Multiplication • There may be significant overhead to manage work items and work

groups.

• Let’s have each work item compute a full row of C

= + *

C(i,j) A(i,:)

B(:,j) C(i,j)

Dot product of a row of A and a column of B for each element of C

Page 71: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 71

An N-dimension domain of work-items

• Global Dimensions: 1000 x 1000 (whole problem space)

• Local Dimensions: 250 x 1000 (One work group per compute unit)

1000

10

00

25

0

Page 72: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 72

Reduce work-item overhead … do one row of C per work-item

__kernel mmul(

const int Mdim,

const int Ndim,

const int Pdim,

__global float* A,

__global float* B,

__global float* C)

{

int k,j;

int i = get_global_id(0);

float tmp;

for(j=0;j<Mdim;j++){

tmp = 0.0;

for(k=0;k<Pdim;k++)

tmp += A[i*Ndim+k] * B[k*Pdim+j];

C[i*Ndim+j] = tmp;

}

}

Page 73: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 73

MatMult host program: one row of C per work-item

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

Changes to host program:

1. 1 D ND Range set to number of rows in the C matrix

2. Local Dim set to 250 so number of work-groups match number of

compute units (4 in this case) for our order 1000 matrices

global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL);

Page 74: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 74

Results: MFLOPS

Case MFLOPS

CPU: Sequential C (not OpenCL) 167

GPU: C(i,j) per work item, all global 511

GPU: C row per work item, all global 258

CPU: C(i,j) per work item 744

This on its own

didn’t help.

• Results on an Apple laptop with an NVIDIA GPU and an Intel CPU.

Device is GeForce® 8600M GT GPU from NVIDIA with a max of 4 compute units

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Page 75: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 75

Optimizing Matrix Multiplication • Notice that each element of C in a row uses the same row of A.

• Let’s copy A into private memory so we don’t incur the overhead of pulling it from global memory for each C(i,j) computation.

= + *

C(i,j) A(i,:)

B(:,j) C(i,j)

Private memory of

each work item

Page 76: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 76

Row of C per work item, A row private

__kernel mmul(

const int Mdim,

const int Ndim,

const int Pdim,

__global float* A,

__global float* B,

__global float* C)

{

int k,j;

int i = get_global_id(0);

float Awrk[1000];

float tmp;

for(k=0;k<Pdim;k++)

Awrk[k] = A[i*Ndim+k];

for(j=0;j<Mdim;j++){

tmp = 0.0;

for(k=0;k<Pdim;k++)

tmp += Awrk[k] * B[k*Pdim+j];

C[i*Ndim+j] = tmp;

}

} Setup a work array for A in private

memory and copy into from global

memory before we start with the

matrix multiplications.

Page 77: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 77

MatMult host program: one row of C per work-item

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

Changes to host program:

1. 1 D ND Range set to number of rows in the C matrix

2. Local Dim set to 250 so number of work-groups match number of

compute units (4 in this case) for our order 1000 matrices

global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL);

Page 78: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 78

Matrix Multiplications Performance • Results on an Apple laptop with an NVIDIA GPU and an Intel CPU.

Device is GeForce® 8600M GT GPU from NVIDIA with a max of 4 compute units

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Case MFLOPS

CPU: Sequential C (not OpenCL) 167

GPU: C(i,j) per work item, all global 511

GPU: C row per work item, all global 258

GPU: C row per work item, A row private 873

CPU: C(i,j) per work item 744

Big impact

Page 79: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 79

Optimizing Matrix Multiplication • Notice that each element of C uses the same row of A.

• Each work-item in a work-group uses the same columns of B

• Let’s store the B columns in local memory

= + *

C(i,j) A(i,:)

B(:,j) C(i,j)

Private memory of

each work item Local memory for

each work-group

Page 80: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 80

Row of C per work item, A row private, B columns local

__kernel mmul(

const int Mdim,

const int Ndim,

const int Pdim,

__global float* A,

__global float* B,

__global float* C,

__local float* Bwrk)

{

int k,j;

int i = get_global_id(0);

int iloc = get_local_id(0);

int nloc = get_local_size(0);

float Awrk[1000];

float tmp;

for(k=0;k<Pdim;k++)

Awrk[k] = A[i*Ndim+k];

for(j=0;j<Mdim;j++){

for(k=iloc;k<Pdim;k=k+nloc)

Bwrk[k] = B[k*Pdim+j];

barrier(CLK_LOCAL_MEM_FENCE);

tmp = 0.0;

for(k=0;k<Pdim;k++)

tmp += Awrk[k] * Bwrk[k];

C[i*Ndim+j] = tmp;

}

} Pass in a pointer to local memory.

Work-items in a group start by

copying the columns of B they

need into the local memory.

Page 81: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 81

MatMult host program: No change from before

#include "mult.h"

int main(int argc, char **argv)

{

float *A, *B, *C;

int Mdim, Ndim, Pdim;

int err, szA, szB, szC;

size_t global[DIM];

size_t local[DIM];

cl_device_id device_id;

cl_context context;

cl_command_queue commands;

cl_program program;

cl_kernel kernel;

cl_uint nd;

cl_mem a_in, b_in, c_out;

Ndim = ORDER;

Pdim = ORDER;

Mdim = ORDER;

szA = Ndim*Pdim;

szB = Pdim*Mdim;

szC = Ndim*Mdim;

A = (float *)malloc(szA*sizeof(float));

B = (float *)malloc(szB*sizeof(float));

C = (float *)malloc(szC*sizeof(float));

initmat(Mdim, Ndim, Pdim, A, B, C);

err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

commands = clCreateCommandQueue(context, device_id, 0, &err);

a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szA, NULL, NULL);

b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szB, NULL, NULL);

c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szC, NULL, NULL);

err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * szA, A, 0, NULL, NULL);

err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szB, B, 0, NULL, NULL);

*program = clCreateProgramWithSource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err);

err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

*kernel = clCreateKernel(*program, "mmul", &err);

err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);

err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);

err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);

err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);

err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);

err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);

err |= clSetKernelArg(*kernel, 6, sizeof(float)*Pdim, NULL);

global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2;

err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL);

clFinish(commands);

err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szC, C, 0, NULL, NULL );

test_results(A, B, c_out); }

Changes to host program:

1. Modify so we pass local memory to kernels. This requires a change to

the kernel argument list … a new call to clSetKernelArg is needed.

err |= clSetKernelArg(*kernel, 6, sizeof(float)*Pdim, NULL);

Page 82: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 82

Matrix Multiplications Performance • Results on an Apple laptop with an NVIDIA GPU and an Intel CPU.

Device is GeForce® 8600M GT GPU from NVIDIA with a max of 4 compute units

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Case MFLOPS

CPU: Sequential C (not OpenCL) 167

GPU: C(i,j) per work item, all global 511

GPU: C row per work item, all global 258

GPU: C row per work item, A row private 873

GPU: C row per work item, A private, B local 2472

CPU: C(i,j) per work item 744

Page 83: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 83

Matrix Multiplications Performance • Results on an Apple laptop with an NVIDIA GPU and an Intel CPU.

Device is GeForce® 8600M GT GPU from NVIDIA with a max of 4 compute units

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Case Speedup

CPU: Sequential C (not OpenCL) 1

GPU: C(i,j) per work item, all global 3

GPU: C row per work item, all global 1.5

GPU: C row per work item, A row private 5.2

GPU: C row per work item, A private, B local 15

CPU: C(i,j) per work item 4.5

Wow!!! OpenCL on a

GPU is radically faster

that C on a CPU, right?

Page 84: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 84

CPU vs GPU: Let’s be fair • We made no attempt to optimize the CPU/C code but we worked hard

to optimize OpenCL/GPU code.

• Lets optimize the CPU code - Use compiler optimization (level O3). - Replace float with double (CPU ALU’s like double) - Reorder loops:

void mat_mul_ijk(int Mdim, int Ndim, int Pdim,

double *A, double *B, double *C)

{

int i, j, k;

for (i=0; i<Ndim; i++)

for (j=0; j<Mdim; j++)

for(k=0;k<Pdim;k++)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

void mat_mul_ikj(int Mdim, int Ndim, int Pdim,

double *A, double *B, double *C)

{

int i, j, k;

for (i=0; i<Ndim; i++)

for(k=0;k<Pdim;k++)

for (j=0; j<Mdim; j++)

C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];

}

- ijk: 272 mflops - ikj: 1130 mflops - kij: 481 mflops

- Float, no opt 167 mflops - Double, O3 272 mflops

Page 85: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 85

Matrix Multiplications Performance • Results on an Apple laptop with an NVIDIA GPU and an Intel CPU.

Device is GeForce® 8600M GT GPU from NVIDIA with a max of 4 compute units

Device is Intel® Core™2 Duo CPU T8300 @ 2.40GHz

3rd party names are the property of their owners.

Case Speedup

CPU: Sequential C (not OpenCL) 1

GPU: C(i,j) per work item, all global 0.45

GPU: C row per work item, all global 0.23

GPU: C row per work item, A row private 0.77

GPU: C row per work item, A private, B local 2.2

CPU: C(i,j) per work item 0.66

And we still are only

using one core … and

we are not using SSE

so there is lots of

room to further

optimize the CPU

code.

Page 86: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 86

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Radix Sort:

- synchronization

• A survey OpenCL 1.1

Page 87: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 87

Original Radix Sort for the GPU by Mark Harris, Nvidia

Parallel work group optimized Radix Sort

Based on work by Lee Howes, AMD

Page 88: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 88

• Sorts a list of fixed size integer keys

- Separates the key into individual digits

- Sorts digit-by-digit

• In this case we use a least significant digit sort

- Sorts first the least significant digit of the key, then the next and so on

- Provides a stable sort

• Radix sort is O(kn) where n is the number of keys and k the key length

- Grows linearly with the data set size, assuming constant memory performance

- It is not necessarily the fasted sort available

Principle of the radix sort

Page 89: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 89

•Take the least significant digit of the key

•Group the keys based on the value of that digit

- Use a counting sort to achieve this

- Maintain the ordering of keys within a value of the digit

•Repeat the process moving on to the next digit

Steps in computation

Page 90: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 90

• We wish to sort the following set of keys in radix-2

- Each digit is 1 bit wide

- Sort the least significant bit first

Sorting a simple radix-2 key

1 1 1 0 1 0 0 1 1 0 0 0

3 1 1 0 3 2 2 1 1 0 0 2

Digit to sort

Input keys The original data

First iteration, least significant digit (or bit)

0 1 2 3 3 4 4 4 5 6 6 6 Number of 1s Count the number of 1s and 0s in the set (1s shown)

0 2 2 0 0 2 3 1 1 3 1 1 Order keys

by digit Sort the keys based on the first digit

Page 91: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 91

• We wish to sort the following set of keys in radix-2

- Each digit is 1 bit wide

- Sort the least significant bit first

Sorting a simple radix-2 key

0 1 1 0 0 1 1 0 0 1 0 0

0 2 2 0 0 2 3 1 1 3 1 1

Digit to sort

Input keys The output of the previous iteration

Second iteration, second-least significant digit (or bit)

0 0 1 2 2 2 3 4 4 4 5 5 Number of 1s Count the number of 1s and 0s in the set (1s shown)

0 0 0 1 1 1 1 2 2 2 3 3 Order keys

by digit Sort the keys based on the second digit

Page 92: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 92

• Sort the keys in radix-16

- 4 bit chunks on each sort pass

- Only 16 counting buckets – easily within the scope of an efficient counting sort

• Divide the set of keys into chunks

- Sort into 16 buckets in local memory

- More efficient global memory traffic scattering into only 16 address ranges

• Global prefix sum to obtain write locations

Implementing on the GPU

Page 93: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 93

High level view

Dataset

Divide into blocks

Sort each block into 16 bins based on current digit

Compute global offsets for bins

Write partially sorted data into output

Page 94: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 94

• Each block in a radix-16 sort is performed using four iterations of the binary sort

Sorting individual blocks

Take a single block of 512 elements

Perform 1-bit prefix sum of 1s (we know the number of 0s as location – number of 1s)

Re-order data into 0s and 1s

Repeat 4 times until we have our data sorted by the 4-bit digit

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42 Compute counts in each bin, giving us a histogram for the block

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42 Store the sorted block and the histogram data

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42

Page 95: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 95

Producing an efficient local prefix sum

• Each sorting pass requires a 1 bit prefix sum to be performed in local memory

• We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size

Take the block of data and load into local memory 1 1 1 0 1 0 0 1

Write 0s into earlier local memory locations 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1

Add [index] to [index – power of 2] with increasing powers of 2

The added 0s allow us to do this without conditionals

The stride of 2 means that we can cover more elements with the wavefront and fix up at the end.

This can be completely barrier free in a single wavefront

When we want to cross a wavefront boundary we must be more careful

Page 96: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 96

Producing an efficient local prefix sum

• Each sorting pass requires a 1 bit prefix sum to be performed in local memory

• We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size:

Take the block of data and load into local memory 1 1 1 0 1 0 0 1

Write 0s into earlier local memory locations 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1

Add [index] to [index – power of 2] with increasing powers of 2

The added 0s allow us to do this without conditionals

The stride of 2 means that we can cover more elements with the wavefront and fix up at the end.

This can be completely barrier free in a single wavefront

When we want to cross a wavefront boundary we must be more careful

Page 97: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 97

Producing an efficient local prefix sum

• Each sorting pass requires a 1 bit prefix sum to be performed in local memory

• We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size:

Take the block of data and load into local memory 1 1 1 0 1 0 0 1

Write 0s into earlier local memory locations 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1

Add [index] to [index – power of 2] with increasing powers of 2

The added 0s allow us to do this without conditionals

The stride of 2 means that we can cover more elements with the wavefront and fix up at the end.

This can be completely barrier free in a single wavefront

When we want to cross a wavefront boundary we must be more careful

Page 98: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 98

Producing an efficient local prefix sum

• Each sorting pass requires a 1 bit prefix sum to be performed in local memory

• We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size:

Take the block of data and load into local memory 1 1 1 0 1 0 0 1

Write 0s into earlier local memory locations 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1

Add [index] to [index – power of 2] with increasing powers of 2

The added 0s allow us to do this without conditionals

The stride of 2 means that we can cover more elements with the wavefront and fix up at the end.

This can be completely barrier free in a single wavefront

When we want to cross a wavefront boundary we must be more careful

if( groupThreadID < 64 )

{

sorterSharedMemory[idx] += sorterSharedMemory[idx-1];

sorterSharedMemory[idx] += sorterSharedMemory[idx-2];

sorterSharedMemory[idx] += sorterSharedMemory[idx-4];

sorterSharedMemory[idx] += sorterSharedMemory[idx-8];

sorterSharedMemory[idx] += sorterSharedMemory[idx-16];

sorterSharedMemory[idx] += sorterSharedMemory[idx-32];

sorterSharedMemory[idx] += sorterSharedMemory[idx-64]

sorterSharedMemory[idx-1] += sorterSharedMemory[idx-2];}

barrier(CLK_LOCAL_MEM_FENCE);

}

Page 99: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 99

Extending to 8x

• Vector register and efficient 128-bit memory loads

• Each WI using vector 4s

• Do internal 4-way prefix sum in vector first

Then write into local memory 3 2 1 0 1 0 0 1

1 1 1 0 1 0 1 0 1 0 0 0 … Perform local prefix sum operation on the vector elements in-registers

Page 100: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 100

Extending to 8x

• Vector register and efficient 128-bit memory loads

• Each WI using vector 4s

• Do internal 4-way prefix sum in vector first

Then write into local memory 3 2 1 0 1 0 0 1

1 1 1 0 1 0 1 0 1 0 0 0 … Perform local prefix sum operation on the vector elements in-registers

// Do sum across vector in two stages

prefixSumData.y += prefixSumData.x;

prefixSumData.w += prefixSumData.z;

prefixSumData.z += prefixSumData.y;

prefixSumData.w += prefixSumData.y;

// Now just 128 values, each sum of a block of 4

sorterSharedMemory[groupThreadID] = 0;

sorterSharedMemory[groupThreadID+128] = prefixSumData.w;

Page 101: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 101

Computing a local histogram • A histogram can be implemented in multiple ways:

- Reduction

- Direct atomics

• The Evergreen architecture supports very efficient local atomics:

- Per-channel integer atomics

- Computed at the memory interface

• So we choose this approach

Page 102: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 102

Computing a local histogram As before we are working on a vector4 per work-item to increase arithmetic density

We first clear the histogram: if( get_local_id(0) < (1<<BITS_PER_PASS) )

histogram[addresses.x] = 0;

Then obtain the approprate 4-bit chunk of the key: sortedData.x >>= startBit;

sortedData.y >>= startBit;

sortedData.z >>= startBit;

sortedData.w >>= startBit;

int andValue = ((1<<BITS_PER_PASS)-1);

sortedData &= (uint4)(andValue, andValue, andValue, andValue);

Page 103: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 103

Computing a local histogram We barrier to allow the two wavefronts to view the cleared histogram, and then we run the atomic operations: barrier(CLK_LOCAL_MEM_FENCE);

atomic_inc( &(histogram[sortedData.x]) );

atomic_inc( &(histogram[sortedData.y]) );

atomic_inc( &(histogram[sortedData.z]) );

atomic_inc( &(histogram[sortedData.w]) );

Finally we output the histogram for global summation:

if( get_local_id(0) < 16 ) {

uint histValues;

histValues = histogram[get_local_id(0)];

unsigned globalOffset = 16*get_group_id(0);

uint globalAddresses = get_local_id(0) + globalOffset;

uint globalAddressesRadixMajor = numGroups;

globalAddressesRadixMajor = globalAddressesRadixMajor * get_local_id(0) +

get_group_id(0);

histogramOutputGroupMajor[globalAddresses] = histValues;

histogramOutputRadixMajor[globalAddressesRadixMajor] = histValues;

}

Page 104: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 104

The global prefix sum • After we have prefix sums for each block we need the global versions

- Each work group needs to know, for each radix, where its output range starts

• For each group we had the histogram representing the number of each radix in the block:

• We need to perform a global prefix sum across these work groups to obtain global addresses:

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42

20 18 68 45 11 40 50 31 54 50 12 30 27 31 17 8

0 73 143 293 438 472

33 107 167 328 450

10 18 58 65 11 25 45 26 54 55 17 40 32 20 30 6

53 125 235 373 461

Page 105: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 105

The global prefix sum • After we have prefix sums for each block we need the global versions

- Each work group needs to know, for each radix, where its output range starts

• For each group we had the histogram representing the number of each radix in the block:

• We need to perform a global prefix sum across these work groups to obtain global addresses:

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42

20 18 68 45 11 40 50 31 54 50 12 30 27 31 17 8

0 73 143 293 438 472

33 107 167 328 450

10 18 58 65 11 25 45 26 54 55 17 40 32 20 30 6

53 125 235 373 461

328

Radix 3 starts at location 328 for the 2nd (green) group

Page 106: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 106

• We have 16 local bins and 16 global bins now for the global sorting phase

• We need to perform a local prefix sum on the block’s histogram to obtain local offsets

Compute global sort

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42

Global sum for radix

Local sums can put us within this range

Index of value in block – local sum tells us the exact location

Destination of data computed as:

0 33 67 91 126 138 187 239 267 302 341 370 403 425 460 480

Page 107: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 107

• Of course we compute this for each radix in parallel in the block making the output highly efficient:

Compute global sort

Global sum for radix

Local sums can put us within this range

Index of value in block – local sum tells us the exact location

Destination of data computed as:

33 34 24 35 12 49 52 28 35 39 29 33 22 35 20 42

0 33 67 91 126 138 187 239 267 302 341 370 403 425 460 480

Page 108: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 108

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Radix Sort:

- synchronization

• A survey OpenCL 1.1

Page 109: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 109

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Radix Sort:

• A survey of OpenCL 1.1

Page 110: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 110

A survey of OpenCL 1.1

Page 111: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 111

OpenCL 1.1 - API

• Thread-safety

• All API calls, except clSetKernelArg, are thread safe

• Sub-buffer objects

• Create an object that represents a specific region in a buffer object

• Easy and efficient mechanism to distribute regions of a buffer object across multiple devices

• OpenCL™ synchronization mechanism ensures modifications to sub-buffer object reflected in appropriate region of parent buffer object

Page 112: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 112

OpenCL 1.1 - API

• User Events

• clEnqueue*** commands can wait on event

• In OpenCL™ 1.0, events can only refer to OpenCL™ commands

• Need ability to enqueue commands that wait on an external, user defined, event

• Event CallBacks

• clSetEventCallbackFn to register a user callback function

• called when command identified by event has completed

• Allows applications to enqueue new OpenCL™ commands based on event state changes in a non-blocking manner

• Lot’s more API stuff too

Page 113: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 113

• Implicit Conversions

•OpenCL™ 1.0 requires widening for arithmetic operators

•OpenCL™ 1.1 extends this feature for all operators •relational, equality, bitwise, logical, ternary

OpenCL 1.1 - Language

float4 a, b;

float c;

b = a + c; // c is widened to a float4 vector

// first and then the add is performed

Page 114: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 114

• 3-component vector data types •And everyone applauds....well almost everyone

• cl_khr_byte_addressable as core feature

• Atomic extensions are now core features • cl_khr_global_int32_{base | extended}_atomics • cl_khr_local_int32_{base | extended}_atomics

OpenCL 1.1 - Language

Page 115: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 115

•New built-in functions •get_global_offset •clamp for integer data types •async_work_group_strided_copy •strided async copy of data from global <---> local memory

•shuffle - construct a permutation of elements from 1 or 2 input vectors and a mask

OpenCL 1.1 - Language

Page 116: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 116

• Improve performance of OpenCL/ OpenGL interoperability •Portable OpenCL/ OpenGL sharing requires

•a glFinish before clEnqueueAcquireGLObjects •a clFinish after clEnqueueReleaseGLObjects

•glFinish / clFinish are heavyweight APIs

OpenCL 1.1 – OpenCL/OpenGL Sharing

Page 117: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 117

• Improve performance of OpenCL/ OpenGL interoperability •Create a OpenCL event from an OpenGL sync object •Create a OpenGL sync object from a OpenCL event •Allows for a finer grained waiting mechanism

•Use event_wait_list argument for events that refer to OpenGL commands to complete

•Use OpenGL sync APIs to wait for specific OpenCL™ commands to complete

OpenCL 1.1 – OpenCL/OpenGL Sharing

Page 118: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 118

Agenda

• Heterogeneous computing and the origins of OpenCL

• OpenCL overview

• Exploring the spec through a series of examples - Vector addition:

- the basic platform layer - Matrix multiplication:

- writing simple kernels - Optimizing matrix multiplication:

- work groups and the memory model - Radix Sort:

• A survey of OpenCL 1.1

Page 119: OpenCL - Interdisciplinary | Innovativekorobkin/tmp/SC10/tutorials/... · 2010. 11. 21. · data-parallel computing Graphics APIs and Shading Languages Multi-processor programming

- Page 119

Conclusion

• OpenCL defines a platform-API/framework for heterogeneous computing … not just GPGPU or CPU-offload programming.

• OpenCL has the potential to deliver portably performant code; but only if its used correctly: - Implicit SIMD data parallel code has the best chance of mapping

onto a diverse range of hardware … once OpenCL implementation quality catches up with mature shader languages.

• The future is clear: - Braided parallelism mixing task parallel and data parallel code in a

single program … balancing the load among ALL OF the platform’s available resources.