hc-4021, efficient scheduling of openmp and opencl™ workloads on accelerated processing units, by...

29
Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU

Upload: amd-developer-central

Post on 10-May-2015

3.365 views

Category:

Technology


3 download

DESCRIPTION

Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

TRANSCRIPT

Page 1: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU

Page 2: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!2

Objective

! software has a long life-span that exceeds the life-span of hardware

! software is very expensive to be written and maintained

! next generation hardware also needs to run legacy software

! Example: IWAVE

! procedural C-code

! no object orientation

! tight integration between data structures and functions

! What do I mean by efficient scheduling?

! find ways to utilize GPU cores for code blocks

! find ways to utilize all CPU cores and GPU units at the same time

Page 3: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!3

Historical ContextGPU Compute Timeline

2002

CUDA

Aparapi

2008 2010 2012

AMP C++

Page 4: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!4

Accelerator ChallengesTechnology Accessibility and Performance

Ease-of-Use

Performance

CPU Single Thread

CPU Multithread

OpenCL & CUDA

Page 5: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!5

APU OpportunitiesOne Die - Two Computational Devices

Metric CPU APU

Memory Size large small

Memory Bandwidth small large

Parallelism small large

General Purpose yes no

Performance application dependent application dependent

Performance-per-Watt application dependent application dependent

Programming Traditional OpenCL

Page 6: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!6

APU OpportunitiesPerformance and Performance-per-Watt

Metric CPU GPU APU

Performance[Pts] 170 197 316

Power[W] 50 37 58

PPW[Pts/W] 3.4 5.3 5.4

Combined[Pts2/W] 578 1049 1722

Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz

! Example: Luxmark OpenCL Benchmark

! Similar CPU and GPU performance

! Best performance by using the APU

! GPU has best performance-per-Watt

! APU provides outstanding value

Page 7: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!7

Example: Luxmark RendererPerformance and Performance-per-Watt

+81%

+64%

Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz

Page 8: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!8

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! Know the problem you are trying to solve.

! staggered rectangular grid in 3D

! coupled first order PDE

! scalar pressure field p

! vector velocity field v = {vx, vy, vz}

! source term g

Page 9: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!9

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

OpenMP vx OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenMP(dom, pars); // calculate pressure field sgn_ts3d_210_v0_OpenMP(dom, pars); // calculate velocity x-axis sgn_ts3d_210_v1_OpenMP(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenMP(dom, pars); // calculate velocity x-axis … }

Page 10: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!10

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! Measure the initial performance.

! pressure and velocity field simulated using OpenMP

! average time T[ms] per iteration

! OpenMP linear scaling with threads

Page 11: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!11

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

OpenMP vx OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

OpenMP vx

OpenMP vz

OpenMP vy OpenMP p

Causality

OpenMP

! find computational blocks

! understand dependencies between blocks

! identify sequential and parallel parts

Page 12: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!12

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenMP(dom, pars); // calculate pressure field p sgn_ts3d_210_v0_OpenCL(dom, pars); // calculate velocity x-axis sgn_ts3d_210_v1_OpenMP(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenMP(dom, pars); // calculate velocity x-axis … }

IDLE OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

OpenCL vx

Page 13: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!13

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! use the GPU to compute vx

! the CPU is idle while the GPU is running

! 42% improvement for 1 thread

! 25% improvement for 2 threads

! 9% improvement for 4 threads

Page 14: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!14

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

OpenCL vx

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenMP(dom, pars); // calculate pressure field p ! int num_threads = atoi(getenv("OMP_NUM_THREADS")); // save the current number of OpenMP threads omp_set_num_threads(2); // restrict the number of OpenMP threads to 2 omp_set_nested(1); // allow nested OpenMP threads !#pragma omp parallel shared(…) private(…) // start 2 OpenMP threads { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) // calculate velocity x-axis using OpenCL break; case 1: omp_set_num_threads(num_threads); // increase number of OpenMP threads back sgn_ts3d_210_v1_OpenMP(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenMP(dom, pars); // calculate velocity z-axis break; default: break; } } // close OpenMP pragma } // close simulation while

Page 15: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!15

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! overlap vx and vy

! CPU not idle anymore

! 50% improvement for 1 thread

! 40% improvement for 2 threads

! 38% improvement for 4 threads

Page 16: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!16

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

Time

OpenCL vx OpenCL vy OpenCL vz OpenCL p OpenCL

while(…) { // main simulation loop sgn_ts3d_210_p012_OpenCL(dom, pars); // calculate pressure field sgn_ts3d_210_v0_OpenCL(dom, pars); // calculate velocity x-axis sgn_ts3d_210_v1_OpenCL(dom, pars); // calculate velocity y-axis sgn_ts3d_210_v2_OpenCL(dom, pars); // calculate velocity x-axis … }

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); // copy data from host to device clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); // execute OpenCL kernel on device clEnqueueReadBuffer(queue, buffer, …); // copy data from device to host … }

Page 17: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!17

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

OpenMP vz OpenMP vy OpenMP p

Time

OpenMP

! understand where performance gets lost

! 98% of time spent on I/O

! 2% of time spent on compute

! reduce I/O

OpenCL vx

OpenCL Upload Kernel Execution OpenCL Download

188ms 4ms 54ms

Page 18: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!18

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed?

! Fraction of OpenCL I/O time: FI/O

! 50% I/O time limit the maximal possible speedup to 2

! Minimize OpenCL I/O, only then increase OpenCL kernel performance

SOpenCL=SKernelHSKernel - 1L FIêO + 1

Page 19: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!19

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) { // main simulation loop sgn_ts3d_210_ALL_OpenCL(dom, pars); // combine all OpenCL calculations … }

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); // copy data from host to device ! while(…) { clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); // execute OpenCL kernel for pressure clEnqueueNDRangeKernel(queue, kernel_V0, dims, …); // execute OpenCL kernel for velocity x clEnqueueNDRangeKernel(queue, kernel_V1, dims, …); // execute OpenCL kernel for velocity y clEnqueueNDRangeKernel(queue, kernel_V1, dims, …); // execute OpenCL kernel for velocity z ! } clEnqueueReadBuffer(queue, buffer, …); // copy data from device to host … }

Time

OpenCL vx OpenCL vy OpenCL vz OpenCL p OpenCL

Page 20: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!20

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! eliminate all but essential I/O

! significant speedup over simple OpenCL

Page 21: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!21

Programming StrategiesExample: Solving the Acoustic Wave Equation in 3D using IWAVE

! measure real application performance

! 3000 iterations using a 97x405x389 simulation grid

! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads

0

3.5

7

10.5

14

CPU (8T) "Piledriver" GPU (8CU) AMD S9000

Page 22: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!22

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz

! initial OpenCL performance measurements

! 89 Algorithms tested for image size of 4MP

! compare OpenCL I/O and execution time

! 28% of all algorithms are compute bound

! 72% of all algorithms are I/O bound

Page 23: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!23

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

! compare OpenCL and single-threaded performance

! 89 Algorithms tested for image size of 4MP

! realistic timing that includes I/O over PCIe

! 59% of all algorithms execute faster on the GPU

! 41% of all algorithms execute faster on the CPU(1)

! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz

Page 24: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!24

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

! Task: Batch process a large amount of images using a single algorithm.

! OpenCL performance is algorithm and image size dependent

! Either the CPU will process data or the GPU, but not both

! How to choose which algorithm and device to use depending on image size?

Page 25: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!25

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

Page 26: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!26

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.

! all CPU cores are fully utilized at all times even for single-threaded algorithms

! all GPU compute units are fully utilized at all times

! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm

! combined performance for multiple algorithms is better than sum of device performance

P = 1⁄i=1N 1

Pi

PiAPU = PiCPU + PiGPU

Page 27: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!27

Programming StrategiesExample: High Throughput Computer Vision with OpenCV

Page 28: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!28

Programming Strategies

!! next generation hardware and legacy code requires compromises

! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time

! application performance can be increased by overlapping OpenCL and OpenMP workloads

! removing all but necessary OpenCL I/O can have a dramatic influence on performance

! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms

! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances

! APUs may provide greatest performance per Watt

! GPUs may provide greatest performance

Summary

Page 29: HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

| OpenCL and OpenMP Workloads on Accelerated Processing Units |!29

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.