introduction to opencl - tu wien

Introduction to OpenCL Ezio Bartocci

Vienna University of Technology

Overview

•  Overview of OpenCL for NVIDIA GPUs

•  API and Languages

•  Sample codes walkthrough

•  OpenCL Information and Resources

OpenCL – Open Computing Language

•  OpenCL is an Open, royalty-free C-language extension

•  It is a framework designed for parallel programming of heterogeneous systems using GPUs, CPUs, FPGA, DSP’s and other processors including embedded mobile devices

•  It was initially introduced by Apple, now is supported by NVIDIA, Intel, AMD, IBM….(that are in the OpenCL working group)

•  Managed by Khronos Group

OpenCL versions and history (1) OpenCL 1.0 (2008) •  OpenCL 1.0 has been released with Mac OS X Snow Leopard

OpenCL 1.1 (2010) •  The Khronos Group adds significant functionality for enhanced parallel

programming flexibility, functionality, and performance including:

•  New data types including 3-component vectors and additional image formats;

•  Handling commands from multiple host threads and processing buffers across multiple devices;

•  Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions;

•  •  Enhanced use of events to drive and control command execution; •  Additional OpenCL built-in C functions such as integer clamp, shuffle, and

asynchronous strided copies;

•  Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

OpenCL versions and history (2) OpenCL 1.2 (2011) •  Most notable features include:

•  Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks.

•  Separate compilation and linking of objects: the functionality to compile OpenCL into

external libraries for inclusion into other programs.

•  Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images.

•  Built-in kernels: custom devices that contain specific unique functionality are now

integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors.

•  DirectX functionality: DX9 media surface sharing allows for efficient sharing between

OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled.

NVIDIA OpenCL Support Operative Systems

•  Windows (XP, VISTA, 8) 32/64 bits •  Linux (Ubuntu, RHEL, etc.) 32/64 bits •  Mac OSX Snow Leopard

IDE’s supported •  GCC for Linux •  Visual Studio for Windows

Drivers and JIT Compiler •  They usually are provided with GPU drivers (i.e. CUDA

drivers…)

NVIDIA SDK •  It contains examples of applications, the specification, the

programming manual and the best practices guide.

OpenCL Language & API Platform Layer API (called from the host)

•  It is an abstraction layer for diverse computational resources •  Query, select and initialize compute devices •  Create compute contexts and work-queues

Runtime API (called from the host) •  Launch compute kernels •  Set kernel execution configuration •  Manage scheduling, compute, and memory resources

OpenCL Language •  Write compute kernels that run on a compute device •  C-based cross-platform programming interface •  Subset of ISO C99 with language extensions •  Include rich set of built-in functions •  Can be compiled Just In Time(JIT) or offline

OpenCL Programming Model

OpenCL Programming Model

NDRange – N-‐Dimensional Range N can be 1, 2 or 3. it defines the global index space for each kernel instance.

OpenCL Programming Model Work-‐item •  A single kernel instance in the index space. •  Each Work-‐item execute the same compute •  Kernel but on different data •  Work-‐items have unique global IDs from the

Index space •  It can be related to the concept of Thread in

CUDA

OpenCL Programming Model Work-‐group •  Work-‐items are further grouped into Work Groups •  Work-‐group have a unique Work-‐group ID •  Work items have a unique local ID within a Work-‐Group •  It can be related to the concept of Block of Threads in

CUDA

OpenCL Memory Model

……..

Local Memory

Global/Constant Memory/ Data Cache Compute Device (e.g. GPU)

Local Memory

Global Memory

Compute Device Memory

Compute Unit 1 Compute Unit N

Work Group

Work-‐Item 1 Work-‐Item M

Private Memory

Private Memory

Work Group

Work-‐Item 1 Work-‐Item M

Private Memory

Private Memory

Private Memory Read/Write access For Work-‐item only

Local Memory Read/Write access For enWre Work Group

Constant Memory Read access For enWre ND-‐range All work-‐items, all work-‐groups

Global Memory Read/write access For enWre ND-‐range All work-‐items, all work-‐groups

Basic Program Structure

Host program •  Create memory objects associated to contexts •  Compile and create kernel program objects •  Issue commands to command-queue •  Synchronization of commands •  Clean up OpenCL resources

•  Query compute devices •  Create contexts

Compute Kernel (runs on device) •  C code with some restrictions and extensions

PLATFORM LAYER

RUNTIME

OpenCL Language

Basic Program Structure Buffer objects

•  1D collection of objects (like C arrays) •  Scalar & Vector types, and user-defined Structures •  They are accessed via pointers in the compute kernel

Image objects •  2D or 3D texture, frame-buffer, or images •  Must be addressed through built-in functions

Sampler objects •  Describe how to sample an image in the kernel

•  Addressing modes •  Filtering modes

OpenCL Language Highlights Function qualifiers

•  “__kernel” qualifier declares a function as a kernel

Address space qualifiers •  “__global, __local, __constant, __private”

Work-item functions •  get_work_dim() •  get_global_id(), get_local_id(), get_group_id(), get_local_size()

Image functions •  Image must be accessed through built-in functions •  Reads/writes performed through sampler objects from host or defined in source

Synchronization functions •  Barriers – All work-items within a work-group must execute the barrier function

before any work-item in the work-group can continue

introduction to opencl - tu wien

Documents