OpenCL Programming in DetailN-Body Algorithm Tutorial
David Richie | November 2010
| OpenCL Programming in Detail | November 8, 20102
Agenda
Brief review of OpenCL™ concepts
Use of STDCL to simplify host code programming
Code walk-through: N-body algorithm tutorial
| OpenCL Programming in Detail | November 8, 20103
Hybrid CPU/GPU Architectures and OpenCL™
CPU CPUMemory
PCIe
GPU GPU
Memory
GPU GPU
Memory
Issues
– Distributed memory management
– Concurrency
– Platform/vendor portability of APIs
OpenCL provides ...
– Platform and runtime layer for managing concurrent execution of operations across multiple devices
– C language extension for programming devices such as CPUs/GPUs
– Platform/device independent API with broad industry support
| OpenCL Programming in Detail | November 8, 20104
Anatomy of OpenCL™
Language Specification
– Based on ISO C99 with added extension and restrictions
Platform API
– Application routines to query system and setup OpenCL™ resources
Runtime API
– Manage kernels objects, memory objects, and executing kernels on OpenCL™ devices
| OpenCL Programming in Detail | November 8, 20105
OpenCL™ Architecture – Execution Model
Kernel:
– Basic unit of executable code that runs on OpenCL™ devices
– Data-parallel or task-parallel
Host program:
– Executes on the host system
– Sends kernels to execute on OpenCL™ devices using command queue
| OpenCL Programming in Detail | November 8, 20106
OpenCL™ C Language - Kernel Programming
Language based on ISO C99
– Some restrictions
Additions to language for parallelism
– Vector types
– Work-items/group functions
– Synchronization
Address Space Qualifiers
Built-in Functions
| OpenCL Programming in Detail | November 8, 20107
Kernels – Expressing Data-Parallelism
Define N-dimensional computation domain
– N = 1, 2, or 3
– Each element in the domain is called a work-item
– N-D domain (global dimensions) defines the total work-items that execute in parallel
– Each work-item executes the same kernel
Example:
– Processing a 1024x1024 image, kernel would be executed 1,048,596 times over a 2D computational domain
| OpenCL Programming in Detail | November 8, 20108
Kernels: Work-item and Work-group
Work-items are grouped into work-groups
– Local dimensions define the size of the work-groups
– Execute together on same compute unit
– Share local memory and synchronization
32
32
Synchronization between
work-items possible only
within work-groups
Cannot synchronize
between workgroups
| OpenCL Programming in Detail | November 8, 20109
Execution Model – Host Program
Create “Context” to manage OpenCL™ resources:
– Devices – OpenCL™ devices to execute kernels
– Program Objects - source or binary that implements kernel functions
– Kernels – the specific function to execute on the devices
– Memory Objects – memory buffers common to the host and devices supporting distributed memory management
| OpenCL Programming in Detail | November 8, 201010
Execution Model – Command Queue
Manage execution of kernels
Accepts:
– Kernel execution commands
– Memory commands
– Synchronization commands
Queued in-order
Execute in-order or out-of-order
| OpenCL Programming in Detail | November 8, 201011
Memory Model
Global – read and write by all work-items and work-groups
Constant – read-only by work-items; read and write by host
Local – used for data sharing; read/write by work-items in same work-group
Private – only accessible to one work-item
Host Memory
Global/Constant Memory
Local Memory
Work-item
Work-item
Private Memory
Private Memory
Workgroup
Host
Compute Device
Local Memory
Work-item
Work-item
Private Memory
Private Memory
Workgroup
Memory management is explicit
– Must move data from host to global to local and back
| OpenCL Programming in Detail | November 8, 201012
Synchronization
OpenCL™ is designed to support concurrent execution of kernels and memory transfers across multiple devices
Programmer is responsible for synchronization using events and/or blocking calls
– Events can be used to prevent one operation from executing until one or more operations have finished
– Programmer can explicitly block on one or more events on the host side
| OpenCL Programming in Detail | November 8, 201013
STDCL: A Simplified Interface to OpenCL™
| OpenCL Programming in Detail | November 8, 201014
STDCL
Idea:
– OpenCL™ provides explicit, platform/device-independent control over execution and data movement
– In practice this can be tedious, the syntax/semantics can be verbose
– Provide simplified interface based on typical use cases in a familiar UNIX style
Here the host code will use STDCL for simplicity
– Allows focus on the concepts, which are complicated enough, without getting lost in low-level syntax
– No restrictions on the functionality provided by OpenCL
– Use of such an API is inevitable for any serious programming project
– Free, open-source, LGPLv3 license
| OpenCL Programming in Detail | November 8, 201015
STDCL (1/4)
Obtaining an OpenCL™ compute layer "context"
– OpenCL: (1) query platforms, (2) select platform, (3) get devices, (4) create contexts for each device, (5) create command queues for each device
– STDCL provides default contexts stddev, stdcpu, stdgpu, ... that
are "ready to go"
#include “stdcl.h”
CONTEXT* stddev; // contains all devices
CONTEXT* stdcpu; // contains all CPU devices
CONTEXT* stdgpu; // contains all GPU devices
Link with -lstdcl
| OpenCL Programming in Detail | November 8, 201016
STDCL (2/4)
Managing OpenCL™ kernels
– OpenCL: (1) manage program text, (2) create program, (3) build program, (4) create kernel
– STDCL provides clopen, clsym, clclose
– Use:
#include “stdcl.h”
void* clh = clopen( stdgpu, "nbody_kern.cl", CLLD_NOW );
cl_kernel krn = clsym( stdgpu, clh, "nbody_kern", CLLD_NOW );
...
clclose( stdgpu, clh );
| OpenCL Programming in Detail | November 8, 201017
STDCL (3/4)
OpenCL™ memory management
– OpenCL: requires use of opaque memory buffers, enqueueingreadbuffer/writebuffer commands
– STDCL provides clmalloc, clfree for allocation of shareable memory
across OpenCL devices
– Use:
#include “stdcl.h”
cl_float4* pos = (cl_float4*)clmalloc( stdgpu, np*sizeof(cl_float4), 0);
...
clfree(pos);
| OpenCL Programming in Detail | November 8, 201018
STDCL (4/4)
Managing execution of concurrent events
– OpenCL: requires enqueueing and managing events
– STDCL provides clmsync, clfork, clwait
– Use:
#include “stdcl.h”
clmsync( stdgpu, 0, pos, CL_MEM_DEVICE|CL_EVENT_NOWAIT );
...
clfork( stdgpu, 0, krn, &ndr, CL_EVENT_NOWAIT );
...
clmsync( stdgpu, 0, pos, CL_MEM_HOST|CL_EVENT_NOWAIT );
...
clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );
| OpenCL Programming in Detail | November 8, 201019
Code Walk-Through: N-Body Algorithm Tutorial
| OpenCL Programming in Detail | November 8, 201020
Code Walk-Through: N-Body Algorithm
Basic N-body algorithm
OpenCL program structure
Implementation:
– Kernel code
– Host code
Compilation
Two-GPU implementation
– Modified kernel code
– Modified host code
| OpenCL Programming in Detail | November 8, 201021
rj - ri
| rj – ri |3mj
i j
fi =
| rj – ri |
ri
rj
Basic N-Body Algorithm
Models motion of N particles subject to particle-particle interaction, e.g.,
– Gravitational force
– Charged particles
Computation is O(N2)
Algorithm has two main steps:
– Calculate total force on each particle
– Update particle position/velocity over some small time-step (Newtonian dynamics)
Entire (unoptimized) algorithm can be written in C with a few dozen lines of code
| OpenCL Programming in Detail | November 8, 201022
Basic N-Body Code (1/2)
// For each particle "i" ...1 for(i=0; i<n; i++) {2 ax = 0.0f;3 ay = 0.0f;4 az = 0.0f;
// Loop over all particles "j" and accumulate force on particle "i"5 for(j=0; j<n; j++) {6 dx = x[j] - x[i]; 7 dy = y[j] - y[i]; 8 dz = z[j] - z[i];9 invr = 1.0f/sqrt(dx * dx + dy * dy + dz * dz + eps);10 invr3 = invr * invr * invr;11 f = m[j]*invr3;12 ax += f * dx;13 ay += f * dy;14 az += f * dx;15 }
For each particle "i" accumulate all forces due to particle "j"
| OpenCL Programming in Detail | November 8, 201023
Basic N-Body Code (2/2)
// update position and velocity of particle "i"16 x_new[i] = x[i] + dt * vx[i] + 0.5f * dt * dt * ax; 17 y_new[i] = y[i] + dt * vy[i] + 0.5f * dt * dt * ay;18 z_new[i] = z[i] + dt * vz[i] + 0.5f * dt * dt * az;19 vx[i] += dt * ax; 20 vy[i] += dt * ay;21 vz[i] += dt * az;22 }
// copy updated positions back into original arrays23 for(i=0; i<n; i++) {24 x[i] = x_new[i];25 y[i] = y_new[i];26 z[i] = z_new[i];27 }
Update positions/velocities using the acceleration (a=F/m)
Repeat
| OpenCL Programming in Detail | November 8, 201024
OpenCL Program Structure
| OpenCL Programming in Detail | November 8, 201025
OpenCL Program Structure
OpenCL implementation will consist of two parts:
Kernel code
– Compiled to run on the GPU, performs the actual computation
– Typically based on critical loops within larger program
– Intended to provide accelerated version of a given algorithm
Host code
– Performs no meaningful computations, still important
– Especially important for using multiple devices
– Initialization and bookkeeping tasks
– Coordinate operations on the OpenCL device(s), e.g.,
Memory management
Kernel execution
| OpenCL Programming in Detail | November 8, 201026
Implementation: Kernel Code
| OpenCL Programming in Detail | November 8, 201027
Kernel Code
Goal is to provide a reasonably standard implementation that is understandable
Attempt to use good practices from an OpenCL perspective, however, it is very likely not optimal for a particular architecture
Important to remember context of OpenCL kernel code:
– Kernel will be executed for every work-item (enumerated thread) within an index-space (range of enumerated threads)
– This application has a one-dimensional index-space with a number of work-items equal to the number of particles in the system
– Kernel code will be invoked once for each of the N particles
– Task for the kernel code is to update the position and velocity of one particle using Newtonian mechanics
| OpenCL Programming in Detail | November 8, 201028
Implementation: Kernel Code (1/5)
1 // nbody_kern.cl
2 __kernel void nbody_kern(3 float dt1, float eps,4 __global float4* pos_old,5 __global float4* pos_new,6 __global float4* vel,7 __local float4* pblock8 ) {
Prototype for kernel
Very similar to a function prototype with few exceptions
– Must by given the qualifier __kernel
– Pointer arguments must be qualified to reflect correct address space, e.g., __global, __local, etc.
Note: kernel code is placed in separate file with the extension
“.cl” to distinguish it from ordinary C
| OpenCL Programming in Detail | November 8, 201029
Implementation: Kernel Code (2/5)
9 const float4 dt = (float4)( dt1, dt1, dt1, 0.0f );
10 int gti = get_global_id(0); // relative to global index-space11 int ti = get_local_id(0); // relative to local work-group
12 int n = get_global_size(0); // global index-space13 int nt = get_local_size(0); // local work-group14 int nb = n/nt;
Size/Index determination and other bookkeeping
Built-in functions allow each work-item (thread) to self-
identify its role in the parallel execution of the kernel over the index-space
Example assuming N = 8192 particles:
– Typical values for Cypress GPU would be n=8192, nt=64, nb=128
– Therefore, 0≤gti<8192 and 0≤ti<64
| OpenCL Programming in Detail | November 8, 201030
Implementation: Kernel Code (3/5)
15 float4 p = pos_old[gti];16 float4 v = vel[gti];
17 float4 a = (float4)( 0.0f, 0.0f, 0.0f, 0.0f );
// For each block ...18 for(int jb=0; jb < nb; jb++) {
// Cache ONE particle position19 pblock[ti] = pos_old[ jb * nt + ti];
// Wait for others in the work-group20 barrier(CLK_LOCAL_MEM_FENCE);
Using local memory as a cache for blocking
Work-items (threads) in work-group perform cooperative read to fill cache
– Each thread copies value from global memory to local memory, expects the other threads to do the same
– barrier used for synchronization
| OpenCL Programming in Detail | November 8, 201031
Implementation: Kernel Code (4/5)
// For each cached particle position ...21 for(int j=0; j<nt; j++) {
// Accumulate force/acceleration22 float4 p2 = pblock[j];23 float4 d = p2 - p;24 float invr = rsqrt(d.x * d.x + d.y * d.y + d.z * d.z + eps);25 float f = p2.w * invr * invr * invr;26 a += f * d;27 }
// Wait for others in work-group28 barrier(CLK_LOCAL_MEM_FENCE);29 }
Perform force calculation using blocking
Inner loop is over the block size where particle positions were cached
barrier is required to prevent overwriting the cache until all
work-items (threads) are done
| OpenCL Programming in Detail | November 8, 201032
Implementation: Kernel Code (5/5)
30 p += dt * v + 0.5f * dt * dt * a;31 v += dt * a;
32 pos_new[gti] = p;33 vel[gti] = v;
34 }
Position and velocity update
Note: we are not updating the original particle position array, but instead are using a double-buffering scheme
| OpenCL Programming in Detail | November 8, 201033
Implementation: Host Code
| OpenCL Programming in Detail | November 8, 201034
Host Code Implementation (1/8)
// nbody.c (one GPU)2 #include <stdcl.h>
3 void nbody_init( int n, cl_float4* pos, cl_float4* vel );4 void nbody_output( int n, cl_float4* pos, cl_float4* vel);
5 int main(int argc, char** argv) {
6 int step, burst;
7 int nparticle = 8192; // Must be power of two for simplicity8 int nstep = 100;9 int nburst = 20; // Must divide nstep without remainder10 int nthread = 64; // chosen for ATI Radeon HD 5870
11 float dt = 0.0001f;12 float eps = 0.0001f;
Initialization of the program
Note: including stdcl.h includes CL/cl.h automatically
| OpenCL Programming in Detail | November 8, 201035
Host Code Implementation (2/8)
size_t nparticle_sz = nparticle * sizeof(cl_float4);
13 cl_float4* pos1 = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);
14 cl_float4* pos2 = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);
15 cl_float4* vel = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);
Contexts and memory allocation
Notes:
– stdgpu is a CONTEXT* provided by stdcl.h that includes all devices
of type GPU with everything ready to use
No set-up of platform, devices, contexts, command queues is required
– Memory allocated with clmalloc is accessible to OpenCL devices
No need to create opaque OpenCL “buffers”
Programmer can manage memory using conventional semantics
| OpenCL Programming in Detail | November 8, 201036
Host Code Implementation (3/8)
16 nbody_init( nparticle, pos1, vel );
17 void* clh = clopen( stdgpu, "nbody_kern.cl", CLLD_NOW );18 cl_kernel krn = clsym( stdgpu, clh, "nbody_kern", CLLD_NOW );
Initialize particle positions and velocities
Load and compile OpenCL kernels
Notes:
– clopen and clsym manage the OpenCL kernels
Modeled after traditional UNIX dynamic loader dlopen and dlsym
No need to manage program text, create program, build program, create kernel
Your kernel is ready to use with only two calls
| OpenCL Programming in Detail | November 8, 201037
Host Code Implementation (4/8)
19 clndrange_t ndr = clndrange_init1d( 0, nparticle, nthread );
20 clarg_set( krn, 0, dt );21 clarg_set( krn, 1, eps );22 clarg_set_global( krn, 4, vel );23 clarg_set_local( krn, 5, nthread * sizeof(cl_float4) );
Set up computational domain and kernel arguments
Notes:
– The N-dimensional range of the index-space is stored in a simple struct using the format { offset, global size, work-group size }
– Why did we skip arguments 2 and 3? - these arguments will not be static, but rather will be switched with a double-buffer scheme
| OpenCL Programming in Detail | November 8, 201038
Host Code Implementation (5/8)
24 clmsync( stdgpu, 0, pos1, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE);
25 clmsync( stdgpu, 0, vel, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE);
Copy particle positions and velocities to GPU
Notes:
– Device ID is "0" in this case - assuming we have one GPU
– In this case the flags indicate blocking calls
– clmsync will initiate OpenCL enqueueReadBuffer or
enqueueWriteBuffer commands as needed
| OpenCL Programming in Detail | November 8, 201039
Host Code Implementation (6/8)
26 for(step=0; step<nstep; step+=nburst) {
27 for(burst=0; burst<nburst; burst+=2) {
28 clarg_set_global( krn, 2, pos1 );29 clarg_set_global( krn, 3, pos2 );30 clfork( stdgpu, 0, krn, &ndr, CL_EVENT_NOWAIT );
31 clarg_set_global( krn, 2, pos2 );32 clarg_set_global( krn, 3, pos1 );33 clfork( stdgpu, 0, krn, &ndr, CL_EVENT_NOWAIT );
34 }
Kernel execution using double-buffer scheme
Notes:
– Arguments 2 and 3 must be set just prior to kernel execution since we are switching between two arrays
– clfork will initiate an OpenCL enqueueNDRange command
– In this case clfork calls are non-blocking
| OpenCL Programming in Detail | November 8, 201040
Host Code Implementation (7/8)
35 clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_EVENT_RELEASE );
36 clmsync( stdgpu, 0, pos1, CL_MEM_HOST|CL_EVENT_WAIT|CL_EVENT_RELEASE );
37 }
Synchronization and read back of data
clwait blocks until the multiple kernel executions are completed
clmsync will copy particle positions back to the host
Both calls are specific to device ID "0", i.e., the GPU
| OpenCL Programming in Detail | November 8, 201041
Host Code Implementation (8/8)
36 nbody_output( nparticle, pos1, vel );
37 clclose( stdgpu, clh );
38 clfree( pos1 );39 clfree( pos2 );40 clfree( vel );
41 }
Output results and clean up resources
| OpenCL Programming in Detail | November 8, 201042
Compilation
| OpenCL Programming in Detail | November 8, 201043
Compilation
1 ### Makefile for N-Body program2 NAME = nbody3 OBJS = nbody_init.o nbody_output.o4 OPENCL = /usr/local/atistream5 STDCL = /usr/local/browndeer6 INCS += -I$(OPENCL)/include -I$(STDCL)/include7 LIBS += -L$(OPENCL)/lib/x86_64 -lOpenCL -lpthread -ldl \
-L$(STDCL)/lib -lstdcl8 CFLAGS += -O3
9 all: $(NAME).x
10 $(NAME).x: $(NAME).o $(OBJS)11 $(CC) $(CFLAGS) $(INCS) -o $(NAME).x $(NAME).o $(OBJS) $(LIBS)
12 .SUFFIXES:13 .SUFFIXES: .c .o
14 .c.o:15 $(CC) $(CFLAGS) $(INCS) -c $<
Makefile shows required includes and links
Modify to suit your local setup
| OpenCL Programming in Detail | November 8, 201044
Using Multiple OpenCL Devices
Example using two GPUs, e.g.
Divide the work for the force calculation and particle position/velocity update across two devices
Synchronization and memory management must be handled explicitly, and with some care
0, 1, 2, ..., N/2 – 1, N/2, ..., N-1
GPU 0 GPU 1
Particle Data
| OpenCL Programming in Detail | November 8, 201045
Implementation for Two GPUs: Kernel Code
| OpenCL Programming in Detail | November 8, 201046
Two-Device Kernel Code (1/6)
1 // nbody_kern.cl (two devices)
2 __kernel void nbody_kern(3 float dt1, float eps,4 __global float4* pos_old,5 __global float4* pos_new,6 __global float4* vel,7 __local float4* pblock,8 __global float4* pos_remote9 ) {
Prototype for kernel
Note: an additional pointer is added to point to the second "remote" array of particles - they are need to calculate the total force on the particles being updated
| OpenCL Programming in Detail | November 8, 201047
Two-Device Kernel Code (2/6)
9 const float4 dt = (float4)( dt1, dt1, dt1, 0.0f );
10 int gti = get_global_id(0);11 int ti = get_local_id(0);
12 int n = get_global_size(0);13 int nt = get_local_size(0);14 int nb = n/nt;
Size and Index determination and other bookkeeping
No change from previous kernel
| OpenCL Programming in Detail | November 8, 201048
Two-Device Kernel Code (3/6)
15 float4 p = pos_old[gti];16 float4 v = vel[gti];
17 float4 a = (float4)( 0.0f, 0.0f, 0.0f, 0.0f );
// For each block ...18 for(int jb=0; jb < nb; jb++) {
// Cache ONE local particle position19 pblock[ti] = pos_old[jb * nt + ti];
// Wait for others in work-group20 barrier(CLK_LOCAL_MEM_FENCE);
Loop over blocks with caching of particle positions
No change from previous kernel accept now we distinguish local and remote particles
| OpenCL Programming in Detail | November 8, 201049
Two-Device Kernel Code (4/6)
// For each cached local particle position ...21 for(int j=0; j<nt; j++) {
// Accumulate acceleration22 float4 p2 = pblock[j];23 float4 d = p2 - p;24 float invr = rsqrt( d.x * d.x + d.y * d.y + d.z * d.z + eps );25 float f = p2.w * invr * invr * invr;26 a += f * d;27 }
// Wait for others in work-group28 barrier(CLK_LOCAL_MEM_FENCE);
Perform force calculation using blocking
Again, no change from previous kernel accept distinction between local and remote particles
| OpenCL Programming in Detail | November 8, 201050
Two-Device Kernel Code (5/6)
// Cache one remote particle position29 pblock[ti] = pos_remote[jb * nt + ti];
// Wait for others in the work-group30 barrier(CLK_LOCAL_MEM_FENCE);
// For each cached remote particle position31 for(int j=0; j<nt; j++) {
// Accumulate acceleration32 float4 p2 = pblock[j];33 float4 d = p2 - p;34 float invr = rsqrt(d.x * d.x + d.y * d.y + d.z * d.z + eps);35 float f = p2.w * invr * invr * invr;36 a += f * d;37 }
// Wait for others in work-group38 barrier(CLK_LOCAL_MEM_FENCE);39 }
This code repeats the above code for remote particles
Only difference is global array from which positions are read
| OpenCL Programming in Detail | November 8, 201051
Two-Device Kernel Code (6/6)
40 p += dt * v + 0.5f * dt * dt * a;41 v += dt * a;
42 pos_new[gti] = p;43 vel[gti] = v;
44 }
Particle and velocity update
No change from previous kernel
| OpenCL Programming in Detail | November 8, 201052
Two-GPU Implementation: Host Code
| OpenCL Programming in Detail | November 8, 201053
Two-Device Host Code (1/9)
// nbody2.c (two GPUs) 2 #include <stdcl.h>
3 void nbody_init( int n, cl_float4* pos, cl_float4* vel );4 void nbody_output( int n, cl_float4* pos, cl_float4* vel);
5 int main(int argc, char** argv) {
6 int step,burst;
7 int nparticle = 8192; // Must be power of two for simplicity
9 int nstep = 100;10 int nburst = 20; // Must divide nstep without remainder11 int nthread = 64; // chosen for ATI Radeon HD 5870
12 float dt = 0.0001f;13 float eps = 0.0001f;
Initialization of program - no change
| OpenCL Programming in Detail | November 8, 201054
Two-Device Host Code (2/9)
size_t nparticle_sz = nparticle * sizeof(cl_float4);
14 cl_float4* pos1 = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);
15 cl_float4* pos1a = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);16 cl_float4* pos1b = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);
17 cl_float4* pos2a = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);18 cl_float4* pos2b = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);
19 cl_float4* vel = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);
20 cl_float4* vela = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);21 cl_float4* velb = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);
Memory allocation
We need global storage for N particles as well as half-sized storage for partitioning particle data across two devices
The designations "a" and "b" will correspond to the two GPUs
| OpenCL Programming in Detail | November 8, 201055
Two-Device Host Code (3/9)
22 nbody_init( nparticle, pos, vel);
23 memcpy( pos1a, pos, nparticle/2 * sizeof(cl_float4) );24 memcpy( pos1b, pos + nparticle/2, nparticle/2 * sizeof(cl_float4) );25 memcpy( vela, vel, nparticle/2 * sizeof(cl_float4) );26 memcpy( velb, vel + nparticle/2, nparticle/2 * sizeof(cl_float4) );
27 void* clh = clopen( stdgpu, "nbody_kern.cl", CLLD_NOW );28 cl_kernel krn = clsym( stdgpu, clh, "nbody_kern", CLLD_NOW );
Particle positions and velocities are initialized as before
The memcpy calls partition the data into half-sized arrays
to distribute the particles across the two GPUs
Load and compile OpenCL kernels - no change
| OpenCL Programming in Detail | November 8, 201056
Two-Device Host Code (4/9)
29 clndrange_t ndr2 = clndrange_init1d( 0, nparticle/2, nthread );
30 clarg_set( krn, 0, dt );31 clarg_set( krn, 1, eps );32 clarg_set_local( krn, 5, nthread * sizeof(cl_float4) );
Set up computational domain and kernel arguments
Only a small change is made to account for the computational domain per device being reduced by a factor of 2 since the work is being distributed across two devices
| OpenCL Programming in Detail | November 8, 201057
Two-Device Host Code (5/9)
33 clmsync( stdgpu, 0, pos1a, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );34 clmsync( stdgpu, 0, pos1b, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );35 clmsync( stdgpu, 0, vela, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );
36 clmsync( stdgpu, 1, pos1a, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );37 clmsync( stdgpu, 1, pos1b, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );38 clmsync( stdgpu, 1, velb, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE);
Copy particle positions and velocities to the two GPUs
The device ID is used to indicate which GPU is involved in the copy, either "0" or "1"
| OpenCL Programming in Detail | November 8, 201058
Two-Device Host Code (6/9)
39 for(step=0; step<nstep; step+=nburst) {
40 for(burst=0; burst<nburst; burst+=2) {
41 clarg_set_global( krn, 2, pos1a );42 clarg_set_global( krn, 3, pos2a );43 clarg_set_global( krn, 4, vela );44 clarg_set_global( krn, 6, pos1b );45 clfork( stdgpu, 0, krn, &ndr2, CL_EVENT_NOWAIT ); // GPU 0
46 clarg_set_global( krn, 2,pos1b );47 clarg_set_global( krn, 3,pos2b );48 clarg_set_global( krn, 4,velb );49 clarg_set_global( krn, 6,pos1a );50 clfork( stdgpu, 1, krn, &ndr2, CL_EVENT_NOWAIT ); // GPU 1
Kernel execution on both GPUs
Note that the clfork calls are non-blocking
| OpenCL Programming in Detail | November 8, 201059
Two-Device Host Code (7/9)
51 clmsync( stdgpu, 0, pos2a, CL_MEM_HOST|CL_EVENT_NOWAIT );52 clmsync( stdgpu, 1, pos2b, CL_MEM_HOST|CL_EVENT_NOWAIT );
53 clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );54 clwait( stdgpu, 1, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );
55 clmsync( stdgpu, 0, pos2b, CL_MEM_DEVICE|CL_EVENT_NOWAIT );56 clmsync( stdgpu, 1, pos2a, CL_MEM_DEVICE|CL_EVENT_NOWAIT );
CPUMemory
PCIe
GPU Memory GPU Memory
Exchange particle position arrays between GPUs
The device ID is used to indicate which GPU is involved in each call, either "0" or "1"
Note the use of clwait for synchronization
| OpenCL Programming in Detail | November 8, 201060
Two-Device Host Code (8/9)57 clarg_set_global( krn, 2, pos2a );58 clarg_set_global( krn, 3, pos1a );59 clarg_set_global( krn, 4, vela );60 clarg_set_global( krn, 6, pos2b );61 clfork( stdgpu, 0, krn, &ndr2, CL_EVENT_NOWAIT );
62 clarg_set_global( krn,2, pos2b );63 clarg_set_global( krn,3, pos1b );64 clarg_set_global( krn,4, velb );65 clarg_set_global( krn,6, pos2a );66 clfork( stdgpu, 1, krn, &ndr2, CL_EVENT_NOWAIT );
67 clmsync( stdgpu, 0, pos1a, CL_MEM_HOST|CL_EVENT_NOWAIT );68 clmsync( stdgpu, 1, pos1b, CL_MEM_HOST|CL_EVENT_NOWAIT );
69 clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );70 clwait( stdgpu, 1, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );
71 clmsync( stdgpu, 0, pos1b, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );72 clmsync( stdgpu, 1, pos1a, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );
73 }
Kernel execution and exchange - everything repeated for the double-buffer scheme
The bookkeeping is left as an exercise
| OpenCL Programming in Detail | November 8, 201061
Two-Device Host Code (9/9)
68 memcpy( pos1, pos1a, nparticle/2 * sizeof(cl_float4) );69 memcpy( pos1 + nparticle/2, pos1b, nparticle/2 * sizeof(cl_float4) );
70 nbody_output( nparticle, pos1, vel );
71 clclose( stdgpu, clh );
72 clfree( pos1 );73 clfree( pos1a );74 clfree( pos1b );75 clfree( pos2 );76 clfree( pos2 );77 clfree( vel ); 78 clfree( vela );79 clfree( velb );
80 }
Output results and clean-up resources
Note the additional step to copy the half-sized arrays into the original array for all particles
| OpenCL Programming in Detail | November 8, 201062
Resources
OpenCL™ standard
– http://www.khronos.org/opencl
ATI Stream SDK
– http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx
– Note: tutorial code tested against version 2.1
STDCL
– Download:
http://www.browndeertechnology.com/stdcl.html#download OR
https://github.com/browndeer/coprthr/tree/stable-1.0
– Documentation:
http://www.browndeertechnology.com/docs/stdcl-manual.html
| OpenCL Programming in Detail | November 8, 201063
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2009 Advanced Micro Devices, Inc. All rights reserved.