cudaprogramming,and,performance,tuning,with, ibmxl ......revolu’onizing,the, datacenter...

Revolu'onizing the Datacenter

Join the Conversa'on #OpenPOWERSummit

CUDA Programming and Performance Tuning with IBM XL Compilers on POWER8 Systems

Shereen Ghobrial, Yaoqing Gao IBM Canada Lab

Join the Conversa'on #OpenPOWERSummit

Istvan Reguly PPCU ITK, Hungary

Agenda

!  IBM XL Compiler Overview !  Performance Tuning with XL Compilers !  Performance results !  Summary

2 4/11/16

Overview of XL C/C++ and Fortran Compilers

3 4/11/16

Power zSystem

Linux AIX z/OS Linux

XL C++

XL FORTRAN

BG/Q

BE LE

Advanced OpUmizaUon Technology •  Full plaQorm exploita'on

!  Enable and exploit POWER hardware features

•  Loop transforma'on !  Analyze and transform loops to improve performance

•  Automa'c SIMDiza'on/Vectoriza'on !  Convert operaUons to allow for several calculaUons to occur simultaneously

•  Paralleliza'on !  AutomaUc parallelizaUon and explicit parallelizaUon through OpenMP

•  Op'mized Math libraries !  Scalar MASS library and vector MASSV library tuned for POWER

•  IPA (Inter-‐Procedural Analysis) !  Apply opUmizaUon techniques to enUre programs

•  PDF (Profile-‐directed feedback) !  Tune applicaUon performance for typical usage scenarios

4 4/11/16

Language standard compliance

"  C99 compliance; C11 compliance on Linux on Power, and selected features on other pla[orms

"  C++98 compliance; C++11 compliance on Linux on Power, and selected features on other pla[orms

"  Fortran 2003 compliance, selected Fortran 2008 and TS29113 features

"  OpenMP 3.1 compliance, selected OpenMP 4.0 features

4/11/16 5

Open Source & GCC affinity

!  Binary compaUbility

4/11/16 6

XL gcc

C++ C++

XL gcc

C++ C++

Makefile Makefile

calls migraUon

!  OpUons, source compaUbility

IBM Advance Toolchain support

Performance Tuning with XL compilers and Libraries

!  Iden'fy applica'on hot spots and performance boZleneck •  Performance tools: perf. Oprofile, gporf, etc.

!  Use XL compiler report to examine compiler op'miza'on -qlistfmt=[xml | html]=inlines generates inlining information -qlistfmt=[xml | html]=transform generates loop transformation information -qlistfmt=[xml | html]=data generates data reorganization information -qlistfmt=[xml | html]=pdf generates dynamic profiling information -qlistfmt=[xml | html]=all turns on all optimization content -qlistfmt=[xml | html]=none turns off all optimization content

!  Tune the performance with XL Compilers •  Typically start from -‐O2 or -‐O3 •  Add high order opUmizaUon –qhot for floaUng-‐point computaUon and memory

bandwidth intensive workload •  Add whole program opUmizaUon –qipa[=level=0 | 1 | 2] for workloads with a lot

of C/C++ small funcUon calls •  Add –qsmp=omp for OpenMP workloads •  Use CUDA C/C++ and Fortran for GPU enabled workloads

7 4/11/16

Performance Tuning with XL compilers and Libraries

!  Consider using the highly tuned MASS/MASSV and ESSL libraries !  Aggressive op'miza'on may affect the results of the program. -‐qstrict

guarantees iden'cal result to noopt, at the expense of op'miza'on. Subop'ons allow fine-‐grain control over this guarantee Examples:

-‐qstrict=precision Strict FP precision -‐qstrict=excepUons Strict FP excepUons -‐qstrict=ieeefp Strict IEEE FP implementaUon -‐qstrict=nans Strict general and computaUon of NANs -‐qstrict=order Do not modify evaluaUon order -‐qstrict=vectorprecision Maintain precision over all loop iteraUons

8 4/11/16

XL C/C++ for Linux (LE) Community Edi'on GA: June 17, 2016

9 4/11/16

!  Free of charge !  Unlimited license for producUon use !  Available for download via IBM developerWorks, or

public apt-‐get, yum, and zypper repositories !  Includes all the core features of XL C/C++ on Linux

Summary of XL Compilers for OpenPOWER

10 4/11/16

Easy migra'on • C/C++ language standard conformance • Full binary compaUbility with GCC • OpUon and source compaUbility with GCC and Clang

Industry leading performance • Full enablement and exploitaUon of the latest Power hardware • Advanced compiler opUmizaUon technologies • OpUmized math libraries • 10 – 30% bener than open source compilers for typical workloads

World class service and support

POWER8 Performance figures

!  IBM S824L – 2x10 cores @ 3.69 GHz

0

50

100

150

200

250

300

350

SMT1 SMT2 SMT4 SMT8

Band

width)(G

B/s)

Multi2threading)mode

3*1.5GB 3*0.75GB 3*0.5GB 3*150MB 3*75MB0

200

400

600

800

1000

1200

SMT1 SMT2 SMT4 SMT8Ba

ndwidth)(G

B/s)

Multi2threading)mode

3*37MB

Off-‐chip memory (256 GB Centaur) Up to 3x over Intel E-‐2650 v3

On-‐chip L3 cache (160 MB total)

11 4/11/16

ComputaUonal throughput

!  IBM ESSL 5.3 mulU-‐threaded: SGEMM and DGEMM

0

100

200

300

400

500

600

700

800

900

1000

SMT1 SMT2 SMT4 SMT8

Compu

tatio

nalthrou

ghpu

t(GF

LOPS/s)

Multi-threadingmode

Single Double

12 4/11/16

vs. MKL E2650 v3 Haswell (with FMA): 1280 (0.72x) single 523 (0.96x) double GFLOPS

Black-‐Scholes 1D !  Compute intensive financial benchmark (alUvec)

4/11/16 13

128

512

2048

8192

SMT1 SMT2 SMT4 SMT8

Execu'

on 'me (m

s)

explicit1 SP scalar

explicit1 SP vector

explicit2 SP vector

implicit1 SP scalar

implicit1 SP vector

128

256

512

1024

2048

4096

8192

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(m

s)

explicit13SP3scalar

explicit13SP3vector

explicit23SP3vector

implicit13SP3scalar

implicit13SP3vector

128

256

512

1024

2048

4096

8192

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(m

s)

explicit13DP3scalar

explicit13DP3vector

explicit23DP3vector

implicit13DP3scalar

implicit13DP3vector

Figure 3: One-factor hand-vectorised Black-Scholes performance on the Power8. The timings are for 50000 (explicit) or 2500(implicit) timesteps, for 6144 options each on a grid of size 256.

Table 2: One-factor hand-vectorised Black-Scholes performance on the Power8 and Intel. The timings are for 50000 (explicit)or 2500 (implicit) timesteps, for 6144 options each on a grid of size 256.

POWER8 Intel E5-2650 v3single prec. double prec. single prec. double prec.

msec GFlop/s msec GFlop/s msec GFlop/s msec GFlop/sexplicit1 scalar 1782 264 3679 127 942 499 1593 293explicit1 vector 821 573 1395 337 1512 311 2805 167explicit2 vector 1131 415 2412 195 756 620 1359 346implicit1 scalar 946 82 1038 65 1191 65 1524 44implicit1 vector 205 380 406 166 288 270 756 89

with u�1 = uJ = 0. Here n is the timestep index whichincreases as the computation progresses, so un+1

j is a sim-ple combination of the values at the nearest neighbours atthe previous timestep. All of the numerical results are for agrid of size J = 256 which corresponds to a fairly high reso-lution grid in financial applications. Additional parallelismis achieved by solving multiple one-factor problems at thesame time, with each one having di↵erent model constants,or a di↵erent financial option payo↵.

A standard implicit time-marching approximation leads toa tridiagonal set of equations of the form

aj un+1j�1 + bj u

n+1j + cj u

n+1j+1 = un

j , j = 0, 1, . . . J � 1

with u�1 = uJ = 0. This is then solved using the Thomasalgorithm, a sequential algorithm for each option at eachtimestep.

The baseline implementations for both explicit and implicitalgorithms have an outermost loop over all options, andthen after initialising the a,b, and c coe�cient arrays, it-erate for a number of timesteps (50000 explicit 2500 im-plicit) updating the values of all 256 grid points - this is theexplicit1/implicit1 implementation. Various compilers mayor may not vectorise over either the outer loop, in whichcase di↵erent lanes of the vector represent di↵erent options,or in the explicit case, over the finite-di↵erence update ofgrid points, with each lane representing a di↵erent gridpoint.Hand-written vectorisations are created for all of these, us-ing vector intrinsics on Intel and the Altivec types on thePower8. Scalar and vectorised versions (over di↵erent op-tions) are named explicit1, implicit1, explicit vectorisationover a single option is named explicit2.

Figure 3 shows the performance of the one-factor bench-

marks in single precision (left figure) and double precision(right figure), both scalar and hand-vectorised versions areshown, the latter use using Altivec instructions. These areprimarily compute-bound benchmarks, as the metrics showin Table 2, around 62-67% of practical peak compute (930GFLOPS/s in single, 500 GFLOPS/s in double) is achievedwith the hand-vectorised explicit1 benchmarks. Hand-vectorisedcode gives significant improvements in all cases; up to fourtimes in both single and double precision. The performancedi↵erence between single and double precisions for both im-plicit and explicit hand-vectorised versions is close to 2⇥. Inthe implicit case, during the Thomas algorithm a reciprocalvalue has to be computed at each grid point, here the fastapproximate reciprocal is used (vec_re), which gives a 15%improvement in overall performance.

In comparison, on Intel Xeon E5-2950 v3 (Haswell) the In-tel compiler does auto-vectorize the explicit1, although it isless performant than the hand-vectorised variant. Overall, itachieves about 50% of the benchmarked peak, slightly out-performing the POWER8 on the explicit benchmark, but itis significantly slower on the implicit benchmark.

5.2 Three-factorThe three-factor test application uses the Black-Scholes PDEfor 3 underlying assets, each corresponding to GeometricBrownian Motion and with positive correlation between the3 driving Brownian motions. This leads to a parabolic PDEwhich spatial cross-derivative terms with positive coe�cients.The spatial approximation of this leads to a 13-point stencilinvolving o↵sets ±(1, 0, 0), ±(0, 1, 0), ±(0, 0, 1), ±(1, 1, 0),±(0, 1, 1) and ±(1, 0, 1), relative to a point with 3D indices(i, j, k). The test case uses a grid of size 2563, with all datastored in the main memory in 1D arrays.

0.92x 0.97x 1.4x 1.8x

CloverLeaf 3D – structured mesh PDE solver

!  Mantevo Suite, primarily BW limited

4/11/16 14

0

10

20

30

40

50

60

20 MPI 40 MPI 80 MPI 160 MPI 20 MPI + 1 OpenMP

20 MPI + 2 OpenMP

20 MPI + 4 OpenMP

20 MPI + 8 OpenMP

RunU

me(s)

CloverLeaf 2D (3840^2, 87 iter) with different compilers and implementaUons

ref XL f kern ref XL c kern OPS XL OPS Clang

Average 160 GB/s, high 270, low 70 GB/s

Intel E5-‐2650 v3: 32 sec (3x) NVIDIA K40: 15 sec (1.4x)

Airfoil – Unstructured mesh FVM

0

10

20

30

40

50

60

70

80

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(s)

Double3precision

MPI3full

MPI3comm

4MPI+OpenMP

4MPI+OpenMP3comm

8MPI+OpenMP

8MPI+OpenMP3comm

0

10

20

30

40

50

60

70

80

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(s)

Single3precision

MPI3full

MPI3comm

4MPI+OpenMP

4MPI+OpenMP3comm

8MPI+OpenMP

8MPI+OpenMP3comm

4/11/16 15

Irregular computaUons bonleneck

Intel E5-‐2650 v3: 23.38/30.27 sec (1.5x), NVIDIA K40 10.5/17.6 sec (0.8x)

Table 5: Useful bandwidth (BW - GB/s) and computational (Comp - GFops/s) throughput of baseline implementations onAirfoil

POWER8 Intel E5-2650 v3Kernel Double precision Single precision Double precision Single precision

Time BW Spdup Time BW Spdup Time BW Time BWsave soln 0.61 301 4.7x 0.14 641 10.2x 2.86 64 1.44 64adt calc 8.39 38 0.7x 7.49 22 0.8x 5.89 55 6.44 25res calc 8.94 77 1.3x 8.29 42 1.3x 11.92 58 10.95 32bres calc 0.02 133 2.5x 0.01 57 3x 0.05 53 0.03 27update 4.55 172 2.1c 3.39 115 1.3x 9.52 82 4.49 87

10. REFERENCES[1] A. V. Adinetz, P. F. Baumeister, H. Bottiger, T. Hater,

T. Maurer, D. Pleiter, W. Schenck, and S. F. Schifano.Performance Evaluation of Scientific Applications onPOWER8.

[2] M. Giles, E. Laszlo, I. Reguly, J. Appleyard, andJ. Demouth. Gpu implementation of finite di↵erencesolvers. In Proceedings of the 7th Workshop on HighPerformance Computational Finance, pages 1–8. IEEEPress, 2014.

[3] M. Heroux and R. Barrett. Mantevo project, 2011.[4] J. D. McCalpin. Memory bandwidth and machine

balance in current high performance computers. IEEEComputer Society Technical Committee on ComputerArchitecture (TCCA) Newsletter, pages 19–25, Dec.1995.

[5] I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran,and S. McIntosh-Smith. The ops domain specificabstraction for multi-block structured gridcomputations. In Proceedings of the FourthInternational Workshop on Domain-Specific Languagesand High-Level Frameworks for High PerformanceComputing, pages 58–67. IEEE Press, 2014.

Rolls-‐Royce Hydra (unstructured mesh)

4/11/16 16

0

5

10

15

20

20 MPI

40 MPI

80 MPI

160 MPI

20 MPI 1 OMP

20 MPI 2 OMP

20 MPI 4 OMP

20 MPI 8 OMP

RunU

me (s)

Time MPI Time

0

2

4

6

8

10

Intel E5-‐2650 v3 POWER8 NVIDIA K80 CUDA

Hydra on Haswell CPU, K80 GPU, POWER8

1.4x 1.8x

Your logo here

Two tracks to challenge and win: 1.  The Open Road Test

–  Port and optimize for OpenPOWER –  Go faster with accelerators (optional)

2.  The Spark Rally –  Train an accelerated DNN and recognize

objects with greater accuracy –  Show you can scale with Spark

Key Dates

Register today openpower.devpost.com

Sun May 1st:

Submission periods opens

Tue Aug 2nd: Submission period closes

Grand prizes include a trip to

Supercompu'ng 2016 Other prizes include iPads, Apple Watches

Join the conversa'on at #OpenPOWERSummit

AddiUonal InformaUon !  XL C/C++ home page

• hnp://www-‐142.ibm.com/souware/products/us/en/ccompfami/ !  C/C++ Café

• hnp://ibm.biz/Bdx8XR !  XL Fortran home page

• hnp://www-‐03.ibm.com/souware/products/en/fortcompfami !  Fortran Café

• hnp://ibm.biz/Bdx8XX !  IBM SDK Linux – Using the CPI plugin

• hnp://www-‐01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpibm.htm

!  PMU events • hnp://www-‐01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpievents.htm

20 4/11/16

AddiUonal InformaUon !  Code op'miza'on with the IBM XL compilers on Power architectures

• hnp://www-‐01.ibm.com/support/docview.wss?uid=swg27005174&aid=1

!  Performance Op'miza'on and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8

• hnp://www.redbooks.ibm.com/abstracts/sg248171.html

!  Implemen'ng an IBM High-‐Performance Compu'ng Solu'on on IBM POWER8

• hnp://www.redbooks.ibm.com/abstracts/sg248263.html?Open

21 4/11/16

Installing XL C/C++ and Fortran Compilers !  Get XL C/C++ and Fortran compilers

C/C++: hnp://www.ibm.com/developerworks/downloads/r/xlcpluslinux Fortran: hnp://www.ibm.com/developerworks/downloads/r/xlfortranlinux

!  Run install uUlity to perform the following tasks: •  Detects the current architecture (big endian or linle endian) •  Installs all prerequisite souware packages (using apt-‐get, zypper or yum) •  Installs all compiler packages into the default locaUon, /opt/ibm/ •  AutomaUcally invokes the xlc_configure uUlity, which installs the license file and

generates the default configuraUon file •  Creates symbolic links in /usr/bin/ to the compiler invocaUon commands

!  To use XL C/C++ with the Advance Toolchain, run xlc_configure –at to create xlc_at and xlC_at

!  Free download XL compiler libraries hnp://public.dhe.ibm.com/souware/server/POWER/Linux/rte/xlcpp/le/

22 4/11/16

Compiling ApplicaUon with XLC and XLF !  Set up the path for IBM XL compilers

export PATH=/opt/ibm/xlC/13.1.3/bin:/opt/ibm/xlf/15.1.3/bin:$PATH !  Check the compiler release and version

xlc –qversion xlC -‐qversion xlf –qversion

!  Compile an applica'on xlc for C (add –qlanglvl=stdc11, –qlanglvl=extc1x to enable C11 ) xlC for C++ (add –qlanglvl=extended0x to enable C++11) xlf, xlf90, xlf95, xlf2003, xlf2008 for Fortran

!  Specify compile op'ons •  -‐O3 or -‐O3 –qhot for floaUng point computaUon intensive applicaUon •  -‐O3 or -‐O3 –qipa for integer applicaUon •  –qsmp=omp for OpenMP applicaUon

xlc -‐qversion IBM XL C/C++ for Linux, V13.1.3 Version: 13.01.0003.0000

23 4/11/16

GPU Programming with CUDA C/C++ and XL C/C++

!  Check CUDA-‐capable GPU lspci | grep -‐i nvidia

!  Verify Linux version uname -‐m && cat /etc/*release

!  Get and Install CUDA C/C++ and XL C/C++ and Fortran compilers CUDA C/C++: hnps://developer.nvidia.com/cuda-‐downloads-‐power8 XL C/C++: hnp://www.ibm.com/developerworks/downloads/r/xlcpluslinux

!  Modify host_config.h for XL C/C++ Change "!=" to ">=" and remove the irrelevant comment.

!  Set up the path for IBM XL compilers export PATH=/opt/ibm/xlC/13.1.3/bin:$PATH

!  Environment Setup export PATH=/usr/local/cuda-‐7.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-‐7.0/lib64:$LD_LIBRARY_PATH

!  Compile and run nvcc -‐ccbin xlC -‐m64 -‐Xcompiler -‐O3 -‐Xcompiler -‐q64 -‐Xcompiler -‐qsmp=omp -‐gencode arch=compute_20,code=sm_20 -‐o cudaOpenMP.o -‐c cudaOpenMP.cu nvcc -‐ccbin xlC -‐m64 -‐Xcompiler -‐O3 -‐Xcompiler -‐q64 -‐o cudaOpenMP cudaOpenMP.o -‐lxlsmp

24 4/11/16

SPEC2006cpu Rate Performance

0200400600800100012001400160018002000

POWER S824 Haswell

E5-4655 V3

SPECint_rate2006

1000

1050

1100

1150

1200

1250

1300

1350SPECfp_rate2006

Haswell E4-4655 V3

POWER S824

•  Power8 is up to 1.4x top 24c Haswell •  XLC V13.1/XLF V15.1 are 10 to 30% bener than GCC 4.9

25 4/11/16

STAC-‐A2 Performance

1 1

2.3 2.1

0.0

0.5

1.0

1.5

2.0

2.5

POWER8 24c/192t

POWER8 24c/192t

STAC-‐A2.β2.GREEKS.TIME.WARM STAC-‐A2.β2.GREEKS.MAX_PATHS

E5-‐2699 v3 36c/72t

E7-‐4890 v2 60c/120t

Rela'v

e Performan

ce (spe

edup

)

hnps://stacresearch.com/news/2015/03/12/stac-‐report-‐stac-‐a2-‐ibm-‐power8-‐soluUon

•  Power8 double the performance of Haswell •  XLC 13.1 is 10% to 20% beZer than GCC 4.9

26 4/11/16

NAS Performance

0

0.5

1

1.5

2

2.5

3

bt cg dc ep u is lu mg sp ua

Speedu

p

NAS benchmarks

Speedup of XL 13.1 and XLF 15.1 over GCC5.2

•  XLC 13 / XLF 15 performs 1.36X bener than GCC5.2

27 4/11/16

cudaprogramming,and,performance,tuning,with, ibmxl ......revolu’onizing,the, datacenter...

Documents