a brief introduction to opencl–it is a standardized, cross-platform, parallel-computing api –it...
TRANSCRIPT
![Page 1: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/1.jpg)
A Brief Introduction to OpenCL • Reference - Programming Massively Parallel
Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11
• What is OpenCL?
– It is a standardized, cross-platform, parallel-computing API
– It is designed to be portable & work with heterogeneous systems
– Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing
– Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code
– In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem
![Page 2: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/2.jpg)
Data Parallelism Model • There is a direct corres-
pondence with CUDA
• Host programs are used to launch kernels on OpenCL devices
• The index space maps data to the work items
• Work items are groups, as in Blocks with CUDA
• Work items in the same group can be synchronized
• The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!)
![Page 3: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/3.jpg)
Parallel Execution Model
![Page 4: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/4.jpg)
Getting global and local values • Thead IDs and Sizes
– The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension)
– If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension
![Page 5: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/5.jpg)
Device Architecture • The CPU is a traditional
computer that exectures the host program
• Here is an OpenCL device
– It contains one or more compute units (CU)
– Each CU contains one or more processing elements (PE)
– There are a variety of memory types
![Page 6: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/6.jpg)
Memory Characteristics • Global memory – dynamically allocated by host, has
read/write access by both host and devices
• Constant memory – dynamically allocated by host (unlike CUDA), read/write by host and read-only by device; a query returns the maximum size supported by the device
• Local memory – most closely corresponds to CUDA shared memory, can be dynamically allocated by the host and statically allocated by the device; cannot be accessed by the host (same as CUDA) but can be accessed by all workers in the work group
• Private memory – corresponds to CUDA local memory
![Page 7: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/7.jpg)
Kernel Functions • Similarities with CUDA
– __kernel corresponds to __global in CUDA
– A vector add kernel is shown below; two input vectors a and b and one output vector result
– All three vectors reside in global memory, the two inputs are const
– This is a 1D problem so get_global_id(0) is used to get the thread index
– The addition is performed as expected
![Page 8: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/8.jpg)
Device Management & Kernel Launch • Now for the “ugly” side of OpenCL
– CUDA, which deals with a uniform device from one manufacturer, has hidden the details of launching apps
– This is not possible in OpenCL which is designed for many widely varied devices from many manufacturers
• An OpenCL context
– Use clCreateContext()
– Use clGetDeviceIDs() to find all devices
– Create a command queue for each device
– A sequence of function call are made to insert the kernel code with its execution parameters
![Page 9: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/9.jpg)
A “Simple” Example - 1 • Line by Line
– Set error code to success
– Call create context from type
• Include all devices (param 2)
• The last argument sets the error code
– Line 3 declares parmsz, the size of the memory buffer
– Line 4 is the first call to clGetContextInfo
• clctx from line 2 is the first param
• Param 4 is NULL since the size is not known
• There will be another call in line 6 where the missing information is supplied
![Page 10: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/10.jpg)
A “Simple” Example - 2 • Line by Line
– Line 5 uses malloc to assign to cldevs the address of the buffer
– Call clGetContextInfo again in line 6
• The third param is set to parmsz
• The fourth param is set to cldevs
• The error code is returned
– Line 7 creates a command queue for the first device
• cldevs is treated as an array and the 2nd param is cldevs[0]
• This generates a command queue for the first device in the list returned by clGetContextInfo
![Page 11: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/11.jpg)
Electrostatic Potential Map in OpenCL • Step 1: design the organization of NDRange
– Threads are now work items; blocks are work groups
– Each work item calculates up to eight grid points
– Each work group has 64 to 256 work items
![Page 12: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/12.jpg)
Mapping DCS NDRange to OpenCL Device • The structure is the same, only the nomenclature is
changed
![Page 13: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/13.jpg)
Changes in Data Access Indexing • The change are relatively minor
– __global__ becomes __kernel
– The access to the .x and .y items and arithmetic are replaced by method calls specifying dimension 0 and 1
![Page 14: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/14.jpg)
The Inner Loop of the DCS kernel • The OpenCL code is shown
– The logic is basically the same
– _rsqrt() has been changed to native_rsqrt
![Page 15: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/15.jpg)
Building an OpenCL kernel – Line 1 – declares entire DCS kernel as a string
– Line 3 – delivers source code string to the OpenCL run time system
– Line 4 – sets up the compiler flags
– Line 5 – invokes the runtime compiler to build program
– Line 6 – handle to kernel that can be submitted to command queue
![Page 16: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/16.jpg)
Host Code for the kernel Launch - 1
![Page 17: A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,](https://reader036.vdocuments.net/reader036/viewer/2022071114/5feb04c0ff209560df054e08/html5/thumbnails/17.jpg)
Host Code for the kernel Launch - 2 – Lines 1 & 2 : allocate memory for energy grid and atoms
– Lines 3 – 6 : sets up arguments to be passed to the kernel
– Line 8 : submits the DCS kernel for launch
– Lines 9-10 : check for errors, if any
– Line 11 : transfers result data in energy array back to host memory
– Lines 12-13 : releases memory