opencl introduction a technical review lu oct. 11 2014

37
OpenCL Introduction A TECHNICAL REVIEW LU LU OCT. 11 2014

Upload: beatrix-sanders

Post on 25-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

OpenCL Introduction

A TECHNICAL REVIEWLU LU

OCT. 11 2014

Page 2: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

2OPENCL INTRODUCTION | APRIL 11, 2014

CONTENTS

1. OpenCL Architecture

2. OpenCL Programming

3. An Matrix Multiplication Example

Page 3: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

1. OPENCL ARCHITECTURE

Page 4: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

4OPENCL INTRODUCTION | APRIL 11, 2014

1. OPENCL ARCHITECTURE

1. Four Architectural ModelsPlatform ModelExecution ModelMemory ModelProgramming Model

2. OpenCL Framework

Page 5: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

5OPENCL INTRODUCTION | APRIL 11, 2014

1.1 FOUR ARCHITECTURAL MODELS

Platform Model

Execution Model

Memory Model

Programming Model

Page 6: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

6OPENCL INTRODUCTION | APRIL 11, 2014

1.1.1 PLATFORM MODEL

Page 7: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

7OPENCL INTRODUCTION | APRIL 11, 2014

1.1.1 PLATFORM MODEL (CONT.)

One host equipped with OpenCL device(s).

An OpenCL device consists of compute unit(s)/CU(s).

A CU consists of processing element(s), or PE(s).– Computations on a device occur within PEs.

Page 8: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

8OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL

Kernels– execute on one or more OpenCL devices

Host Program– executes on the host– defines the context for the kernels– manages the execution of kernels

Page 9: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

9OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

NDRange– an N-dimensional index space, where N is 1, 2 or 3

WORK-ITEM– an instance of the kernel– identified by a global ID in the NDRange– executes the same code in parallel

• The specific execution pathway through the code and the data operated upon can vary per work-item.

Page 10: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

10OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

WORK-GROUP– Provide a coarse-grained decomposition of NDRange;– Be assigned a unique work-group ID with the same dimensionality as

NDRange;– Use a unique local ID to identify each of its work-items.– Its work-items execute concurrently on the PEs of a single CU.– Kernels could use some synchronization controls within a work-group.– The NDRange size should be a multiple of the work-group size.

Page 11: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

11OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

Page 12: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

12OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

Context– The host defines a context for the execution of the kernels.

Resources in the context:– Devices

• The collection of OpenCL devices to be used by the host.

– Kernels• The OpenCL functions that run on OpenCL devices.

– Program Objects• The program source and executable that implement the kernels.

– Memory Objects• A set of memory objects visible to the host and the OpenCL devices.• Memory objects contain values that can be operated on by instances of a kernel.

Page 13: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

13OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

Command-queue– The host creates a data structure called a command-queue to coordinate

execution of the kernels on the devices.– The host places commands into the command-queue which are then

scheduled onto the devices within the context.– The command-queue schedules commands for execution on a device.– Commands execute asynchronously between the host and the device.

Page 14: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

14OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

Commands in command-queue:

– Kernel execution commands• Execute a kernel on the processing elements of a device.

– Memory commands• Transfer data to, from, or between memory objects, or map and unmap memory

objects from the host address space.

– Synchronization commands• Constrain the order of execution of commands.

Page 15: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

15OPENCL INTRODUCTION | APRIL 11, 2014

1.1.2 EXECUTION MODEL (CONT.)

Commands execute modes:

– In-order Execution

– Out-of-order Execution• Any order constraints are enforced by the programmer through explicit

synchronization commands

Page 16: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

16OPENCL INTRODUCTION | APRIL 11, 2014

1.1.3 MEMORY MODEL

Page 17: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

17OPENCL INTRODUCTION | APRIL 11, 2014

1.1.3 MEMORY MODEL (CONT.)

Private Memory– Per work-item

Local Memory– Shared within a work-group

Global/Constant Memory– Latter is cached

Host Memory– On the CPU

Memory management is explicit– must move data from host -> global -> local and back

Page 18: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

18OPENCL INTRODUCTION | APRIL 11, 2014

1.1.3 MEMORY MODEL (CONT.)

Memory Region– Allocation and Memory Access Capabilities

Page 19: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

19OPENCL INTRODUCTION | APRIL 11, 2014

1.1.3 MEMORY MODEL (CONT.)

Memory Consistency

– OpenCL uses a relaxed consistency memory model; i.e., the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times

– Within a work-item, memory has load/store consistency

– Within a work-group at a barrier, local memory has consistency across work-items

– Global memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groups

– Consistency of memory shared between commands are enforced through synchronization

Page 20: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

20OPENCL INTRODUCTION | APRIL 11, 2014

1.1.4 PROGRAMMING MODEL

Data Parallel Programming Model– All the work-items in NDRange execute in parallel.

Task Parallel Programming Model– Executing a kernel on a compute unit with a work-group containing a single

work-item.– Express parallelism by:

• using vector data types implemented by the device,• enqueuing multiple tasks, and/or• enqueuing native kernels developed using a programming model orthogonal to

OpenCL.

Page 21: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

21OPENCL INTRODUCTION | APRIL 11, 2014

1.1.4 PROGRAMMING MODEL (CONT.)

Synchronization

– Work-items in a single work-group• Work-group barrier

– Commands enqueued to command-queue(s) in a single context• Command-queue barrier• Waiting on an event.

Page 22: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

22OPENCL INTRODUCTION | APRIL 11, 2014

1.1.4 PROGRAMMING MODEL (CONT.)

Events Synchronization

Page 23: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

23OPENCL INTRODUCTION | APRIL 11, 2014

1.2 OPENCL FRAMEWORK

OpenCL Platform layer– This layer allows a host program to discover OpenCL devices and their

capabilities and to create contexts.

OpenCL Runtime– The runtime allows the host program to manipulate created contexts.

OpenCL Compiler– The compiler creates executable program containing OpenCL kernels. The

OpenCL programming language implemented by the compiler supports a subset of the ISO C99 language with parallelism extensions.

Page 24: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

2. OPENCL PROGRAMMING

Page 25: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

25OPENCL INTRODUCTION | APRIL 11, 2014

2.2 BASIC STEPS

Step 1: Discover and initialize the platforms

Step 2: Discover and initialize the devices

Step 3: Create the context

Step 4: Create a command queue

Step 5: Create device buffers

Step 6: Write the host data to device buffers

Page 26: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

26OPENCL INTRODUCTION | APRIL 11, 2014

2.2 BASIC STEPS (CONT.)

Step 7: Create and compile the program

Step 8: Create the kernel

Step 9: Set the kernel arguments

Step 10: Configure the work-item structure

Step 11: Enqueue the kernel for execution

Step 12: Read the output buffer back to the host

Step 13: Release the OpenCL resources

Page 27: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

27OPENCL INTRODUCTION | APRIL 11, 2014

2.3 BASIC STRUCTURE

Host program– Query compute devices– Create the context and command-queue– Create memory objects associated to the context– Compile and create kernel objects– Issue commands to command-queue– Synchronization of commands– Release OpenCL resources

Kernels– C code with come restrictions and extensions

Platform Layer

Runtime

Language

Page 28: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

3. AN EXAMPLE

Page 29: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

29OPENCL INTRODUCTION | APRIL 11, 2014

3.1 DESCRIPTION OF THE PROBLEM

is a matrix

is a matrix

Satisfy

Calculate – which would be a matrix

Page 30: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

30OPENCL INTRODUCTION | APRIL 11, 2014

3.2 SERIAL IMPLEMENTATION

Page 31: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

31OPENCL INTRODUCTION | APRIL 11, 2014

3.3 CALCULATION PROCEDURE DIAGRAM

A

B

C

Page 32: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

32OPENCL INTRODUCTION | APRIL 11, 2014

3.4 CHARACTERS OF THE CALCULATION

Each element in is calculate by the same computing with different data of and .

The calculation for each element in C is independent to any others.– There is no write collision.

So it is suitable for data-parallel computing.

Page 33: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

33OPENCL INTRODUCTION | APRIL 11, 2014

3.5 OPENCL IMPLEMENTATION

We assign one work-item for each element of .

We code a kernel for the calculation of one element of .

We use a 2DRange of size .– All the elements of would be generated concurrently.

Page 34: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

34OPENCL INTRODUCTION | APRIL 11, 2014

3.6 OPENCL MATRIX-MULTIPLY CODE

kernel

Page 35: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

35OPENCL INTRODUCTION | APRIL 11, 2014

3.7 OPENCL IMPLEMENTATION

What should be done in the host is illustrated in the right figure.

Set the size of NDRange (and work-group) when enqueuing the kernel.

The calculation for each element in would be done in parallel.

Pla

tfor

m la

yer

Run

time

laye

r

Com

pile

r

Query platform

Query devices

Command queue

Create kernel

Compile program

Create buffers

Set arguments

Execute kernel

Page 36: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

36OPENCL INTRODUCTION | APRIL 11, 2014

THANK YOU!

Page 37: OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014

37OPENCL INTRODUCTION | APRIL 11, 2014

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.