a brief introduction to opencl–it is a standardized, cross-platform, parallel-computing api –it...

17
A Brief Introduction to OpenCL Reference - Programming Massively Parallel Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11 What is OpenCL? It is a standardized, cross-platform, parallel-computing API It is designed to be portable & work with heterogeneous systems Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem

Upload: others

Post on 08-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

A Brief Introduction to OpenCL • Reference - Programming Massively Parallel

Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11

• What is OpenCL?

– It is a standardized, cross-platform, parallel-computing API

– It is designed to be portable & work with heterogeneous systems

– Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing

– Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code

– In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem

Page 2: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Data Parallelism Model • There is a direct corres-

pondence with CUDA

• Host programs are used to launch kernels on OpenCL devices

• The index space maps data to the work items

• Work items are groups, as in Blocks with CUDA

• Work items in the same group can be synchronized

• The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!)

Page 3: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Parallel Execution Model

Page 4: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Getting global and local values • Thead IDs and Sizes

– The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension)

– If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension

Page 5: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Device Architecture • The CPU is a traditional

computer that exectures the host program

• Here is an OpenCL device

– It contains one or more compute units (CU)

– Each CU contains one or more processing elements (PE)

– There are a variety of memory types

Page 6: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Memory Characteristics • Global memory – dynamically allocated by host, has

read/write access by both host and devices

• Constant memory – dynamically allocated by host (unlike CUDA), read/write by host and read-only by device; a query returns the maximum size supported by the device

• Local memory – most closely corresponds to CUDA shared memory, can be dynamically allocated by the host and statically allocated by the device; cannot be accessed by the host (same as CUDA) but can be accessed by all workers in the work group

• Private memory – corresponds to CUDA local memory

Page 7: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Kernel Functions • Similarities with CUDA

– __kernel corresponds to __global in CUDA

– A vector add kernel is shown below; two input vectors a and b and one output vector result

– All three vectors reside in global memory, the two inputs are const

– This is a 1D problem so get_global_id(0) is used to get the thread index

– The addition is performed as expected

Page 8: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Device Management & Kernel Launch • Now for the “ugly” side of OpenCL

– CUDA, which deals with a uniform device from one manufacturer, has hidden the details of launching apps

– This is not possible in OpenCL which is designed for many widely varied devices from many manufacturers

• An OpenCL context

– Use clCreateContext()

– Use clGetDeviceIDs() to find all devices

– Create a command queue for each device

– A sequence of function call are made to insert the kernel code with its execution parameters

Page 9: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

A “Simple” Example - 1 • Line by Line

– Set error code to success

– Call create context from type

• Include all devices (param 2)

• The last argument sets the error code

– Line 3 declares parmsz, the size of the memory buffer

– Line 4 is the first call to clGetContextInfo

• clctx from line 2 is the first param

• Param 4 is NULL since the size is not known

• There will be another call in line 6 where the missing information is supplied

Page 10: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

A “Simple” Example - 2 • Line by Line

– Line 5 uses malloc to assign to cldevs the address of the buffer

– Call clGetContextInfo again in line 6

• The third param is set to parmsz

• The fourth param is set to cldevs

• The error code is returned

– Line 7 creates a command queue for the first device

• cldevs is treated as an array and the 2nd param is cldevs[0]

• This generates a command queue for the first device in the list returned by clGetContextInfo

Page 11: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Electrostatic Potential Map in OpenCL • Step 1: design the organization of NDRange

– Threads are now work items; blocks are work groups

– Each work item calculates up to eight grid points

– Each work group has 64 to 256 work items

Page 12: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Mapping DCS NDRange to OpenCL Device • The structure is the same, only the nomenclature is

changed

Page 13: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Changes in Data Access Indexing • The change are relatively minor

– __global__ becomes __kernel

– The access to the .x and .y items and arithmetic are replaced by method calls specifying dimension 0 and 1

Page 14: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

The Inner Loop of the DCS kernel • The OpenCL code is shown

– The logic is basically the same

– _rsqrt() has been changed to native_rsqrt

Page 15: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Building an OpenCL kernel – Line 1 – declares entire DCS kernel as a string

– Line 3 – delivers source code string to the OpenCL run time system

– Line 4 – sets up the compiler flags

– Line 5 – invokes the runtime compiler to build program

– Line 6 – handle to kernel that can be submitted to command queue

Page 16: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Host Code for the kernel Launch - 1

Page 17: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

Host Code for the kernel Launch - 2 – Lines 1 & 2 : allocate memory for energy grid and atoms

– Lines 3 – 6 : sets up arguments to be passed to the kernel

– Line 8 : submits the DCS kernel for launch

– Lines 9-10 : check for errors, if any

– Line 11 : transfers result data in energy array back to host memory

– Lines 12-13 : releases memory