hc-4021, efficient scheduling of openmp and opencl™ workloads on accelerated processing units, by...

Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!2

Objective

! software has a long life-span that exceeds the life-span of hardware

! software is very expensive to be written and maintained

! next generation hardware also needs to run legacy software

! Example: IWAVE

! procedural C-code

! no object orientation

! tight integration between data structures and functions

! What do I mean by efficient scheduling?

! find ways to utilize GPU cores for code blocks

! find ways to utilize all CPU cores and GPU units at the same time


Historical ContextGPU Compute Timeline

2002

CUDA

Aparapi

2008 2010 2012

AMP C++


Accelerator ChallengesTechnology Accessibility and Performance

Ease-of-Use

Performance

CPU Single Thread

CPU Multithread

OpenCL & CUDA


APU OpportunitiesOne Die - Two Computational Devices

Metric CPU APU

Memory Size large small

Memory Bandwidth small large

Parallelism small large

General Purpose yes no

Performance application dependent application dependent

Performance-per-Watt application dependent application dependent

Programming Traditional OpenCL


APU OpportunitiesPerformance and Performance-per-Watt

Metric CPU GPU APU

Performance[Pts] 170 197 316

Power[W] 50 37 58

PPW[Pts/W] 3.4 5.3 5.4

Combined[Pts2/W] 578 1049 1722

Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz

! Example: Luxmark OpenCL Benchmark

! Similar CPU and GPU performance

! Best performance by using the APU

! GPU has best performance-per-Watt

! APU provides outstanding value


Example: Luxmark RendererPerformance and Performance-per-Watt

+81%

+64%

Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz


Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! Know the problem you are trying to solve.

! staggered rectangular grid in 3D

! coupled first order PDE

! scalar pressure field p

! vector velocity field v = {vx, vy, vz}

! source term g



OpenMP vx OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenMP(dom, pars); // calculate pressure field sgn_ts3d_210_v0_OpenMP(dom, pars); // calculate velocity x-axis sgn_ts3d_210_v1_OpenMP(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenMP(dom, pars); // calculate velocity x-axis … }



! Measure the initial performance.

! pressure and velocity field simulated using OpenMP

! average time T[ms] per iteration

! OpenMP linear scaling with threads



OpenMP vx OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

OpenMP vx

OpenMP vz

OpenMP vy OpenMP p

Causality

OpenMP

! find computational blocks

! understand dependencies between blocks

! identify sequential and parallel parts



while(…) { // main simulation loop sgn_ts3d_210_p012_OpenMP(dom, pars); // calculate pressure field p sgn_ts3d_210_v0_OpenCL(dom, pars); // calculate velocity x-axis sgn_ts3d_210_v1_OpenMP(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenMP(dom, pars); // calculate velocity x-axis … }

IDLE OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

OpenCL vx



! use the GPU to compute vx

! the CPU is idle while the GPU is running

! 42% improvement for 1 thread

! 25% improvement for 2 threads




OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

OpenCL vx

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenMP(dom, pars); // calculate pressure field p ! int num_threads = atoi(getenv("OMP_NUM_THREADS")); // save the current number of OpenMP threads omp_set_num_threads(2); // restrict the number of OpenMP threads to 2 omp_set_nested(1); // allow nested OpenMP threads !#pragma omp parallel shared(…) private(…) // start 2 OpenMP threads { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) // calculate velocity x-axis using OpenCL break; case 1: omp_set_num_threads(num_threads); // increase number of OpenMP threads back sgn_ts3d_210_v1_OpenMP(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenMP(dom, pars); // calculate velocity z-axis break; default: break; } } // close OpenMP pragma } // close simulation while



! overlap vx and vy

! CPU not idle anymore

! 50% improvement for 1 thread





Time

OpenCL vx OpenCL vy OpenCL vz OpenCL p OpenCL

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenCL(dom, pars); // calculate pressure field sgn_ts3d_210_v0_OpenCL(dom, pars); // calculate velocity x-axis sgn_ts3d_210_v1_OpenCL(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenCL(dom, pars); // calculate velocity x-axis … }

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); // copy data from host to device clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); // execute OpenCL kernel on device clEnqueueReadBuffer(queue, buffer, …); // copy data from device to host … }



OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

! understand where performance gets lost

! 98% of time spent on I/O

! 2% of time spent on compute

! reduce I/O

OpenCL vx

OpenCL Upload Kernel Execution OpenCL Download

188ms 4ms 54ms


Programming StrategiesExample: High Throughput Computer Vision with OpenCV

! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed?

! Fraction of OpenCL I/O time: FI/O

! 50% I/O time limit the maximal possible speedup to 2

! Minimize OpenCL I/O, only then increase OpenCL kernel performance

SOpenCL=SKernelHSKernel - 1L FIêO + 1



while(…) { // main simulation loop sgn_ts3d_210_ALL_OpenCL(dom, pars); // combine all OpenCL calculations … }

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); // copy data from host to device ! while(…) { clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); // execute OpenCL kernel for pressure clEnqueueNDRangeKernel(queue, kernel_V0, dims, …); // execute OpenCL kernel for velocity x clEnqueueNDRangeKernel(queue, kernel_V1, dims, …); // execute OpenCL kernel for velocity y clEnqueueNDRangeKernel(queue, kernel_V1, dims, …); // execute OpenCL kernel for velocity z ! } clEnqueueReadBuffer(queue, buffer, …); // copy data from device to host … }

Time

OpenCL vx OpenCL vy OpenCL vz OpenCL p OpenCL



! eliminate all but essential I/O

! significant speedup over simple OpenCL



! measure real application performance

! 3000 iterations using a 97x405x389 simulation grid

! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads

0

3.5

7

10.5

14

CPU (8T) "Piledriver" GPU (8CU) AMD S9000



OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz

! initial OpenCL performance measurements

! 89 Algorithms tested for image size of 4MP

! compare OpenCL I/O and execution time

! 28% of all algorithms are compute bound

! 72% of all algorithms are I/O bound



! compare OpenCL and single-threaded performance

! 89 Algorithms tested for image size of 4MP

! realistic timing that includes I/O over PCIe

! 59% of all algorithms execute faster on the GPU

! 41% of all algorithms execute faster on the CPU(1)

! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz



! Task: Batch process a large amount of images using a single algorithm.

! OpenCL performance is algorithm and image size dependent

! Either the CPU will process data or the GPU, but not both

! How to choose which algorithm and device to use depending on image size?



! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.

! all CPU cores are fully utilized at all times even for single-threaded algorithms

! all GPU compute units are fully utilized at all times

! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm

! combined performance for multiple algorithms is better than sum of device performance

P = 1⁄i=1N 1

Pi

PiAPU = PiCPU + PiGPU


Programming Strategies

!! next generation hardware and legacy code requires compromises

! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time

! application performance can be increased by overlapping OpenCL and OpenMP workloads

! removing all but necessary OpenCL I/O can have a dramatic influence on performance

! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms

! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances

! APUs may provide greatest performance per Watt

! GPUs may provide greatest performance

Summary


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

hc-4021, efficient scheduling of openmp and opencl™ workloads on accelerated processing units, by...

Technology

openmp workloads

opencl workloads

pars sgn

nested openmp threads

threads sgn

gpu units

efficient scheduling

gpu compute units