evolution of opencl *

Introduction to OpenCL*

Ohad Shacham

Intel Software and Services Group

Thanks to Elior Malul, Arik Narkis, and Doron Singer 1

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Evolution of OpenCL*

2

Sequential Programs

void scalar_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i];}

int main(){ //read input scalar_mul(…) return 0;}




Multi-threaded Programs

void scalar_mul(int n, const float *a,

const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i];}

int main(){ //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0;}



Problems – concurrent programs

• Writing concurrent programs is hard

• Concurrent algorithms

• Threads

• Work balancing• Need to update programs when adding new cores to the system

• Dataraces, livelocks, deadlocks• Solving bugs in concurrent programs is harder

4




5

Vector instruction utilization

void scalar_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); }}

int main(){ //read input scalar_mul(…) return 0;}



Problems – vector instructions usage

• Utilizing vector instructions in also not a trivial task

• Vendor dependent code

• Usage is not future proof• New efficient instruction• Wider vector registers

6



7

GPGPUGPGPU stands for General-Purpose computation on Graphics Processing Units (GPUs). GPUs are high-performance many-core processors that can be used to accelerate a wide range of applications

(www.gpgpu.org)

Photo taken from: http://folding.stanford.edu/English/FAQ-NVIDIA

http://www.gpgpu.org/



GPUs utilization

• Many cores can be utilized for computation

• GPUs become programmable - GPGPU• CUDA*

• Problems• Each vendor has its own language• Requires tweaking to get performance• How can I run both on CPUs and GPUs?

8



What do we need?

• Heterogeneous• Automatically utilizes all available processing units• Portable

• High Performance• Utilize Hardware characteristics

• Future Proof

• Abstract concurrency from the user

9



OpenCL* – heterogeneous computing

10

Diagram based on deck presented in OpenCL* BOF at SIGGRAPH 2010 by Neil Trevett, NVIDIA, OpenCL* Chair



OpenCL* in a nutshell

An OpenCL* application consists two parts:

• A set of APIs in C that allows compiling and running OpenCL* “Kernels”

• A code that is executed on the device by the OpenCL* runtime

11



Data parallelism

12

A fundamental pattern in high-performance parallel algorithms

Applying same computation logic across multiple data elements

C[i] = A[i] * B[i]

i = 0

i = i + 1

C[i] = A[i] * B[i]

C[i] = A[i] * B[i]

C[i] = A[i] * B[i]

C[i] = A[i] * B[i]

C[i] = A[i] * B[i]

C[i] = A[i] * B[i]

i = 0

i = 1

i = 2

i = 3

i = N-2

i = N-1



13

Data parallelism UsageClient machines• Video transcoding and editing• Pro image editing• Facial recognition

Workstations• CAD tools• 3D data content creation

Servers• Science and simulations• Medical imaging• Oil & Gas• Finance (e.g., Black-Scholes)• …



14

OpenCL* kernel example

void array_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i];}

__kernelvoid array_mul( __global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id];}



15

OpenCL* kernel example __kernelvoid array_mul(__global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id];}

a

b

c

get_global_id(0)



16

Execution Model

Work GroupWork GroupWork Group Work Group

Work Item

Global



The OpenCL* model• OpenCL* runtime is invoked on Host CPU (using OpenCL* API)

– Choose target device/s for parallel computation

• Data-parallel functions, called Kernels, are compiled (on host)

• Compiled for specific target devices (CPU, GPU, etc..)

• Data chunks (called Buffers) are moved across devices

• Kernel “commands” queued for execution on target devices– Asynchronous execution



18

The OpenCL* - C language• Derived from ISO C99

• Few restrictions e.g., recursion, function pointers

• Short vector types e.g., float4, short2, int16

• Built-in functions – math (e.g., sin), geometric, common (e.g., min, clamp)



Unified programming model for all devices• Develop once, run everywhere

Designed for massive data-parallelism• Implicitly takes care of threading and intrinsics for optimal

performance

19

OpenCL* key features



Dynamic compilation model (Just In Time - JIT) • Future proof, provided vendors update their implementations

Enables heterogeneous computing• A clever application can use all resources of the platform

simultaneously

20

OpenCL* key features



Benefits to User

• Hardware abstraction• write once, run everywhere• Cross devices, cross vendors

• Automatic parallelization

• Good tradeoff between development simplicity and performance

• Future proof optimizations

• Open standard• Supported by many vendors

21



Benefits to Hardware Vendor

• Enables good hardware ‘time to market’

• Programming model enables good hardware utilization

• Applications are automatically portable and future proof– JIT compilation

22



OpenCL* Cons

• Low level – based on C99 • No heap!• Lean framework

• Expert tool• In term of correctness and performance

• OpenCL* is not performance portable• Tweaking is needed for each vendor• Future specs and implementations may require no tweaking?

23



Vector dot multiplication

24

void vectorDotMul(int* vecA, int* vecB, int size, int* result){ *result = 0; for (int i=0; i < size; ++i) *result += vecA[i] * vecB[i];}



25

111111

222222

11

22

Single work item

* = 2* = 24* = 26* = 28* = 210* = 212* = 21214* = 216



Vector dot multiplication in OpenCL*

26

__kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result) { if (get_global_id(0) == 0){ *result = 0; for (int i=0; i<size; ++i) *result += vecA[i] * vecB[i]; }}



27

11

11

11

22

22

22

11

22

Single work group

* = 2* = 24

* = 2

* = 2

* = 2* = 2

* = 2* = 2

4

4

4

8

12

16



28

__kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; int start = id*work; int end = start+work; for (int j=start; j<end; ++j) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i];}

Work item calculation

Reduction



29

11

11

11

22

22

22

11

22

Efficient reduction

* = 2* = 24

* = 2

* = 2

* = 2* = 2

* = 2* = 2

4

4

4

8

4

816



Vectorization

• Processors provide vector units• SIMD on CPUs• Warp on GPUs

• Utilize to perform few operations in parallel– Arithmetic operations– Binary operations – Memory operation

30



Loop vectorization

31

void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }}



Loop vectorization

32

void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }}



Loop vectorization

33

void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); }}



Automatic loop vectorization

34

Is there dependency between a, b, and c?





35

cb





36

cb

void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }}



Automatic vectorization in OpenCL*

37

__kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); c[id] = a[id] * b[id];}




38

for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id];}




39

for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3];}




40

for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec);}



41

11

11

11

22

22

22

11

22

Single work group

* = 2* = 24

* = 2

* = 2

* = 2* = 2

* = 2* = 2

4

4

4

8

4

816



42

1

1

1

1

1

1

2

2

2

2

2

2

1

1

2

2

Vectorizer friendly

* = 2

* = 24

* = 2

* = 2

* = 2

* = 2

* = 2

* = 2

444

84816



43

__kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize;

for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j];

barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i];}

Work item calculation

Reduction



Predication

44

__kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; }}



Predication

45

for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; }}

How can we vectorize the loop?



Predication

46

for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id];

c[id] = (mask) ? c1 : c2;}



Predication

47

for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id);

__m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask);

__mm_store_ps(c + id, c3_vec);}



General tweaking

• Consecutive memory accesses• SIMD, WARP

• How can we vectorize with control flow?

• Can we somehow create an efficient code with control flow?• Uniform CF• CF diverge in SIMD size

• Enough work groups to utilize machine

48



Architecture tweaking

CPU• Locality• No local memory (also slow in some GPUs)• Enough compute for a work group• Overcome thread creation overhead

GPU• Use local memory• Avoid bank conflicts

49



Conclusion

• OpenCL* is an open standard that lets developers:– Write the same code for any type of processor

• Use all existing resources of a platform in their application

• Automatic parallelism

• OpenCL* applications are automatically portable and forward compatible

• OpenCL* is still an expert tool– OpenCL* is not performance portable– Tweaking for each vendor should be done

50



INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization NoticeIntel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


51

evolution of opencl *

Documents

opencl logo

khronosevolution of

khronos1evolution of

intel corporation

psc i

n i ci

trademarks of apple

respective owners