cudaprogramming,and,performance,tuning,with, ibmxl ......revolu’onizing,the, datacenter...

27
Revolu’onizing the Datacenter CUDA Programming and Performance Tuning with IBM XL Compilers on POWER8 Systems Shereen Ghobrial, Yaoqing Gao IBM Canada Lab Join the Conversa’on #OpenPOWERSummit Istvan Reguly PPCU ITK, Hungary

Upload: others

Post on 06-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Revolu'onizing  the  Datacenter  

Join  the  Conversa'on  #OpenPOWERSummit  

CUDA  Programming  and  Performance  Tuning  with  IBM  XL  Compilers  on  POWER8  Systems  

Shereen  Ghobrial,  Yaoqing  Gao  IBM  Canada  Lab  

 

Join  the  Conversa'on  #OpenPOWERSummit  

Istvan  Reguly  PPCU  ITK,  Hungary  

Page 2: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Agenda  

!  IBM  XL  Compiler  Overview  !  Performance  Tuning  with  XL  Compilers  !  Performance  results  !  Summary  

2  4/11/16  

Page 3: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Overview  of  XL  C/C++  and  Fortran  Compilers  

3  4/11/16  

Power   zSystem  

Linux   AIX   z/OS  Linux  

XL  C++  

XL  FORTRAN  

BG/Q  

BE   LE  

Page 4: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Advanced  OpUmizaUon  Technology  •  Full  plaQorm  exploita'on  

!  Enable  and  exploit  POWER  hardware  features    

•  Loop  transforma'on  !  Analyze  and  transform  loops  to  improve  performance  

•  Automa'c  SIMDiza'on/Vectoriza'on  !  Convert  operaUons  to  allow  for  several  calculaUons  to  occur  simultaneously  

•  Paralleliza'on  !  AutomaUc  parallelizaUon  and  explicit  parallelizaUon  through  OpenMP  

•  Op'mized  Math  libraries  !  Scalar  MASS  library  and  vector  MASSV  library  tuned  for  POWER  

•  IPA  (Inter-­‐Procedural  Analysis)  !  Apply  opUmizaUon  techniques  to  enUre  programs  

•  PDF  (Profile-­‐directed  feedback)  !  Tune  applicaUon  performance  for  typical  usage  scenarios  

4  4/11/16  

Page 5: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Language  standard  compliance  

"  C99  compliance;  C11  compliance  on  Linux  on  Power,  and  selected  features  on  other  pla[orms  

"  C++98  compliance;  C++11  compliance  on  Linux  on  Power,  and  selected  features  on  other  pla[orms  

"  Fortran  2003  compliance,  selected  Fortran  2008  and  TS29113  features  

"  OpenMP  3.1  compliance,  selected  OpenMP  4.0  features  

4/11/16   5  

Page 6: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Open  Source  &  GCC  affinity  

!  Binary  compaUbility  

4/11/16   6  

XL   gcc  

C++   C++  

XL   gcc  

C++   C++  

Makefile  Makefile  

calls  migraUon  

!  OpUons,  source  compaUbility  

IBM  Advance  Toolchain  support  

Page 7: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Performance  Tuning  with  XL  compilers  and  Libraries  

!  Iden'fy  applica'on  hot  spots  and  performance  boZleneck  •  Performance  tools:  perf.  Oprofile,  gporf,  etc.  

!  Use  XL  compiler  report  to  examine  compiler  op'miza'on  -qlistfmt=[xml | html]=inlines generates inlining information -qlistfmt=[xml | html]=transform generates loop transformation information -qlistfmt=[xml | html]=data generates data reorganization information -qlistfmt=[xml | html]=pdf generates dynamic profiling information -qlistfmt=[xml | html]=all turns on all optimization content -qlistfmt=[xml | html]=none turns off all optimization content  

!  Tune  the  performance  with  XL  Compilers  •  Typically  start  from  -­‐O2  or  -­‐O3    •  Add  high  order  opUmizaUon  –qhot  for  floaUng-­‐point  computaUon  and  memory  

bandwidth  intensive  workload  •  Add  whole  program  opUmizaUon  –qipa[=level=0  |  1  |  2]  for  workloads  with  a  lot  

of  C/C++  small  funcUon  calls  •  Add  –qsmp=omp  for  OpenMP  workloads  •  Use  CUDA  C/C++  and  Fortran  for  GPU  enabled  workloads  

7  4/11/16  

Page 8: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Performance  Tuning  with  XL  compilers  and  Libraries  

!  Consider  using  the  highly  tuned  MASS/MASSV  and  ESSL  libraries  !  Aggressive  op'miza'on  may  affect  the  results  of  the  program.    -­‐qstrict  

guarantees  iden'cal  result  to  noopt,  at  the  expense  of  op'miza'on.  Subop'ons  allow  fine-­‐grain  control  over  this  guarantee  Examples:  

-­‐qstrict=precision    Strict  FP  precision  -­‐qstrict=excepUons    Strict  FP  excepUons  -­‐qstrict=ieeefp    Strict  IEEE  FP  implementaUon  -­‐qstrict=nans    Strict  general  and  computaUon  of  NANs  -­‐qstrict=order    Do  not  modify  evaluaUon  order  -­‐qstrict=vectorprecision  Maintain  precision  over  all  loop  iteraUons    

8  4/11/16  

Page 9: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

XL  C/C++  for  Linux  (LE)  Community  Edi'on  GA:  June  17,  2016  

9  4/11/16  

!  Free  of  charge  !  Unlimited  license  for  producUon  use  !  Available  for  download  via  IBM  developerWorks,  or  

public  apt-­‐get,  yum,  and  zypper  repositories  !  Includes  all  the  core  features  of  XL  C/C++  on  Linux    

Page 10: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Summary  of  XL  Compilers  for  OpenPOWER  

10  4/11/16  

Easy  migra'on  • C/C++  language  standard  conformance  • Full  binary  compaUbility  with  GCC  • OpUon  and  source  compaUbility  with  GCC  and  Clang  

Industry  leading  performance  • Full  enablement  and  exploitaUon  of  the  latest  Power  hardware  • Advanced  compiler  opUmizaUon  technologies  • OpUmized  math  libraries  • 10  –  30%  bener  than  open  source  compilers  for  typical  workloads  

World  class  service  and  support  

Page 11: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

POWER8  Performance  figures  

!  IBM  S824L  –  2x10  cores  @  3.69  GHz  

0

50

100

150

200

250

300

350

SMT1 SMT2 SMT4 SMT8

Band

width)(G

B/s)

Multi2threading)mode

3*1.5GB 3*0.75GB 3*0.5GB 3*150MB 3*75MB0

200

400

600

800

1000

1200

SMT1 SMT2 SMT4 SMT8Ba

ndwidth)(G

B/s)

Multi2threading)mode

3*37MB

Off-­‐chip  memory  (256  GB  Centaur)  Up  to  3x  over  Intel  E-­‐2650  v3  

On-­‐chip  L3  cache  (160  MB  total)  

11  4/11/16  

Page 12: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

ComputaUonal  throughput  

!  IBM  ESSL  5.3  mulU-­‐threaded:  SGEMM  and  DGEMM  

0

100

200

300

400

500

600

700

800

900

1000

SMT1 SMT2 SMT4 SMT8

Compu

tatio

nalthrou

ghpu

t(GF

LOPS/s)

Multi-threadingmode

Single Double

12  4/11/16  

vs.  MKL  E2650  v3  Haswell  (with  FMA):  1280  (0.72x)  single  523  (0.96x)  double  GFLOPS    

Page 13: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Black-­‐Scholes  1D  !  Compute  intensive  financial  benchmark  (alUvec)  

4/11/16   13  

128  

512  

2048  

8192  

SMT1   SMT2   SMT4   SMT8  

Execu'

on  'me  (m

s)  

explicit1  SP  scalar  

explicit1  SP  vector  

explicit2  SP  vector  

implicit1  SP  scalar  

implicit1  SP  vector  

128

256

512

1024

2048

4096

8192

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(m

s)

explicit13SP3scalar

explicit13SP3vector

explicit23SP3vector

implicit13SP3scalar

implicit13SP3vector

128

256

512

1024

2048

4096

8192

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(m

s)

explicit13DP3scalar

explicit13DP3vector

explicit23DP3vector

implicit13DP3scalar

implicit13DP3vector

Figure 3: One-factor hand-vectorised Black-Scholes performance on the Power8. The timings are for 50000 (explicit) or 2500(implicit) timesteps, for 6144 options each on a grid of size 256.

Table 2: One-factor hand-vectorised Black-Scholes performance on the Power8 and Intel. The timings are for 50000 (explicit)or 2500 (implicit) timesteps, for 6144 options each on a grid of size 256.

POWER8 Intel E5-2650 v3single prec. double prec. single prec. double prec.

msec GFlop/s msec GFlop/s msec GFlop/s msec GFlop/sexplicit1 scalar 1782 264 3679 127 942 499 1593 293explicit1 vector 821 573 1395 337 1512 311 2805 167explicit2 vector 1131 415 2412 195 756 620 1359 346implicit1 scalar 946 82 1038 65 1191 65 1524 44implicit1 vector 205 380 406 166 288 270 756 89

with u�1 = uJ = 0. Here n is the timestep index whichincreases as the computation progresses, so un+1

j is a sim-ple combination of the values at the nearest neighbours atthe previous timestep. All of the numerical results are for agrid of size J = 256 which corresponds to a fairly high reso-lution grid in financial applications. Additional parallelismis achieved by solving multiple one-factor problems at thesame time, with each one having di↵erent model constants,or a di↵erent financial option payo↵.

A standard implicit time-marching approximation leads toa tridiagonal set of equations of the form

aj un+1j�1 + bj u

n+1j + cj u

n+1j+1 = un

j , j = 0, 1, . . . J � 1

with u�1 = uJ = 0. This is then solved using the Thomasalgorithm, a sequential algorithm for each option at eachtimestep.

The baseline implementations for both explicit and implicitalgorithms have an outermost loop over all options, andthen after initialising the a,b, and c coe�cient arrays, it-erate for a number of timesteps (50000 explicit 2500 im-plicit) updating the values of all 256 grid points - this is theexplicit1/implicit1 implementation. Various compilers mayor may not vectorise over either the outer loop, in whichcase di↵erent lanes of the vector represent di↵erent options,or in the explicit case, over the finite-di↵erence update ofgrid points, with each lane representing a di↵erent gridpoint.Hand-written vectorisations are created for all of these, us-ing vector intrinsics on Intel and the Altivec types on thePower8. Scalar and vectorised versions (over di↵erent op-tions) are named explicit1, implicit1, explicit vectorisationover a single option is named explicit2.

Figure 3 shows the performance of the one-factor bench-

marks in single precision (left figure) and double precision(right figure), both scalar and hand-vectorised versions areshown, the latter use using Altivec instructions. These areprimarily compute-bound benchmarks, as the metrics showin Table 2, around 62-67% of practical peak compute (930GFLOPS/s in single, 500 GFLOPS/s in double) is achievedwith the hand-vectorised explicit1 benchmarks. Hand-vectorisedcode gives significant improvements in all cases; up to fourtimes in both single and double precision. The performancedi↵erence between single and double precisions for both im-plicit and explicit hand-vectorised versions is close to 2⇥. Inthe implicit case, during the Thomas algorithm a reciprocalvalue has to be computed at each grid point, here the fastapproximate reciprocal is used (vec_re), which gives a 15%improvement in overall performance.

In comparison, on Intel Xeon E5-2950 v3 (Haswell) the In-tel compiler does auto-vectorize the explicit1, although it isless performant than the hand-vectorised variant. Overall, itachieves about 50% of the benchmarked peak, slightly out-performing the POWER8 on the explicit benchmark, but itis significantly slower on the implicit benchmark.

5.2 Three-factorThe three-factor test application uses the Black-Scholes PDEfor 3 underlying assets, each corresponding to GeometricBrownian Motion and with positive correlation between the3 driving Brownian motions. This leads to a parabolic PDEwhich spatial cross-derivative terms with positive coe�cients.The spatial approximation of this leads to a 13-point stencilinvolving o↵sets ±(1, 0, 0), ±(0, 1, 0), ±(0, 0, 1), ±(1, 1, 0),±(0, 1, 1) and ±(1, 0, 1), relative to a point with 3D indices(i, j, k). The test case uses a grid of size 2563, with all datastored in the main memory in 1D arrays.

0.92x   0.97x   1.4x   1.8x  

Page 14: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

CloverLeaf  3D  –  structured  mesh  PDE  solver  

!  Mantevo  Suite,  primarily  BW  limited  

4/11/16   14  

0  

10  

20  

30  

40  

50  

60  

20  MPI   40  MPI   80  MPI   160  MPI   20  MPI  +  1  OpenMP  

20  MPI  +  2  OpenMP  

20  MPI  +  4  OpenMP  

20  MPI  +  8  OpenMP  

RunU

me(s)  

CloverLeaf  2D  (3840^2,  87  iter)  with  different  compilers  and  implementaUons  

ref  XL  f  kern   ref  XL  c  kern   OPS  XL   OPS  Clang  

Average  160  GB/s,  high  270,  low  70  GB/s  

Intel  E5-­‐2650  v3:  32  sec  (3x)  NVIDIA  K40:  15  sec  (1.4x)  

Page 15: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Airfoil  –  Unstructured  mesh  FVM  

0

10

20

30

40

50

60

70

80

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(s)

Double3precision

MPI3full

MPI3comm

4MPI+OpenMP

4MPI+OpenMP3comm

8MPI+OpenMP

8MPI+OpenMP3comm

0

10

20

30

40

50

60

70

80

SMT1 SMT2 SMT4 SMT8

Execution*tim

e*(s)

Single3precision

MPI3full

MPI3comm

4MPI+OpenMP

4MPI+OpenMP3comm

8MPI+OpenMP

8MPI+OpenMP3comm

4/11/16   15  

Irregular  computaUons  bonleneck  

Intel  E5-­‐2650  v3:  23.38/30.27  sec  (1.5x),  NVIDIA  K40  10.5/17.6  sec  (0.8x)  

Table 5: Useful bandwidth (BW - GB/s) and computational (Comp - GFops/s) throughput of baseline implementations onAirfoil

POWER8 Intel E5-2650 v3Kernel Double precision Single precision Double precision Single precision

Time BW Spdup Time BW Spdup Time BW Time BWsave soln 0.61 301 4.7x 0.14 641 10.2x 2.86 64 1.44 64adt calc 8.39 38 0.7x 7.49 22 0.8x 5.89 55 6.44 25res calc 8.94 77 1.3x 8.29 42 1.3x 11.92 58 10.95 32bres calc 0.02 133 2.5x 0.01 57 3x 0.05 53 0.03 27update 4.55 172 2.1c 3.39 115 1.3x 9.52 82 4.49 87

10. REFERENCES[1] A. V. Adinetz, P. F. Baumeister, H. Bottiger, T. Hater,

T. Maurer, D. Pleiter, W. Schenck, and S. F. Schifano.Performance Evaluation of Scientific Applications onPOWER8.

[2] M. Giles, E. Laszlo, I. Reguly, J. Appleyard, andJ. Demouth. Gpu implementation of finite di↵erencesolvers. In Proceedings of the 7th Workshop on HighPerformance Computational Finance, pages 1–8. IEEEPress, 2014.

[3] M. Heroux and R. Barrett. Mantevo project, 2011.[4] J. D. McCalpin. Memory bandwidth and machine

balance in current high performance computers. IEEEComputer Society Technical Committee on ComputerArchitecture (TCCA) Newsletter, pages 19–25, Dec.1995.

[5] I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran,and S. McIntosh-Smith. The ops domain specificabstraction for multi-block structured gridcomputations. In Proceedings of the FourthInternational Workshop on Domain-Specific Languagesand High-Level Frameworks for High PerformanceComputing, pages 58–67. IEEE Press, 2014.

Page 16: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Rolls-­‐Royce  Hydra  (unstructured  mesh)  

4/11/16   16  

0  

5  

10  

15  

20  

20  MPI  

40  MPI  

80  MPI  

160  MPI  

20  MPI  1  OMP  

20  MPI  2  OMP  

20  MPI  4  OMP  

20  MPI  8  OMP  

RunU

me  (s)  

Time   MPI  Time  

0  

2  

4  

6  

8  

10  

Intel  E5-­‐2650  v3   POWER8   NVIDIA  K80  CUDA  

Hydra  on  Haswell  CPU,  K80  GPU,  POWER8  

1.4x  1.8x  

Page 17: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Your  logo  here  

Two  tracks  to  challenge  and  win:  1.  The Open Road Test

–  Port and optimize for OpenPOWER –  Go faster with accelerators (optional)

2.  The Spark Rally –  Train an accelerated DNN and recognize

objects with greater accuracy –  Show you can scale with Spark

Key  Dates    

Register  today  openpower.devpost.com    

 Sun  May  1st:    

Submission  periods  opens    

Tue  Aug  2nd:    Submission  period  closes  

 Grand  prizes  include  a  trip  to  

Supercompu'ng  2016  Other  prizes  include  iPads,  Apple  Watches  

 Join  the  conversa'on  at  #OpenPOWERSummit  

Page 18: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,
Page 19: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,
Page 20: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

AddiUonal  InformaUon  !  XL  C/C++  home  page  

• hnp://www-­‐142.ibm.com/souware/products/us/en/ccompfami/  !  C/C++  Café  

• hnp://ibm.biz/Bdx8XR  !  XL  Fortran  home  page  

• hnp://www-­‐03.ibm.com/souware/products/en/fortcompfami  !  Fortran  Café  

• hnp://ibm.biz/Bdx8XX  !  IBM  SDK  Linux  –  Using  the  CPI  plugin  

• hnp://www-­‐01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpibm.htm  

!  PMU  events  • hnp://www-­‐01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpievents.htm  

20  4/11/16  

Page 21: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

AddiUonal  InformaUon  !  Code  op'miza'on  with  the  IBM  XL  compilers  on  Power  architectures  

• hnp://www-­‐01.ibm.com/support/docview.wss?uid=swg27005174&aid=1  

!  Performance  Op'miza'on  and  Tuning  Techniques  for  IBM  Power  Systems  Processors  Including  IBM  POWER8  

• hnp://www.redbooks.ibm.com/abstracts/sg248171.html  

!  Implemen'ng  an  IBM  High-­‐Performance  Compu'ng  Solu'on  on  IBM  POWER8  

• hnp://www.redbooks.ibm.com/abstracts/sg248263.html?Open  

21  4/11/16  

Page 22: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Installing  XL  C/C++  and  Fortran  Compilers  !  Get  XL  C/C++  and  Fortran  compilers  

C/C++:  hnp://www.ibm.com/developerworks/downloads/r/xlcpluslinux  Fortran:  hnp://www.ibm.com/developerworks/downloads/r/xlfortranlinux    

!  Run  install  uUlity  to  perform  the  following  tasks:    •  Detects  the  current  architecture  (big  endian  or  linle  endian)  •  Installs  all  prerequisite  souware  packages  (using  apt-­‐get,  zypper  or  yum)  •  Installs  all  compiler  packages  into  the  default  locaUon,  /opt/ibm/  •  AutomaUcally  invokes  the  xlc_configure  uUlity,  which  installs  the  license  file  and  

generates  the  default  configuraUon  file  •  Creates  symbolic  links  in  /usr/bin/  to  the  compiler  invocaUon  commands  

!  To  use  XL  C/C++  with  the  Advance  Toolchain,  run  xlc_configure  –at  to  create  xlc_at  and  xlC_at  

!  Free  download  XL  compiler  libraries    hnp://public.dhe.ibm.com/souware/server/POWER/Linux/rte/xlcpp/le/    

22  4/11/16  

Page 23: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

Compiling  ApplicaUon  with  XLC  and  XLF  !  Set  up  the  path  for  IBM  XL  compilers  

export  PATH=/opt/ibm/xlC/13.1.3/bin:/opt/ibm/xlf/15.1.3/bin:$PATH  !  Check  the  compiler  release  and  version  

xlc  –qversion  xlC  -­‐qversion  xlf  –qversion  

!  Compile  an  applica'on  xlc  for  C  (add  –qlanglvl=stdc11,    –qlanglvl=extc1x  to  enable  C11  )    xlC  for  C++  (add  –qlanglvl=extended0x  to  enable  C++11)  xlf,  xlf90,  xlf95,  xlf2003,  xlf2008  for  Fortran  

!  Specify  compile  op'ons  •  -­‐O3  or  -­‐O3  –qhot  for  floaUng  point  computaUon  intensive  applicaUon  •  -­‐O3  or  -­‐O3  –qipa  for  integer  applicaUon  •  –qsmp=omp  for  OpenMP  applicaUon  

xlc  -­‐qversion  IBM  XL  C/C++  for  Linux,  V13.1.3    Version:  13.01.0003.0000  

23  4/11/16  

Page 24: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

GPU  Programming  with  CUDA  C/C++  and  XL  C/C++  

!  Check  CUDA-­‐capable  GPU  lspci  |  grep  -­‐i  nvidia  

!  Verify  Linux  version  uname  -­‐m  &&  cat  /etc/*release  

!  Get  and  Install  CUDA  C/C++  and  XL  C/C++  and  Fortran  compilers  CUDA  C/C++:  hnps://developer.nvidia.com/cuda-­‐downloads-­‐power8  XL  C/C++:  hnp://www.ibm.com/developerworks/downloads/r/xlcpluslinux  

!  Modify  host_config.h  for  XL  C/C++  Change  "!="  to  ">="  and  remove  the  irrelevant  comment.  

!  Set  up  the  path  for  IBM  XL  compilers  export  PATH=/opt/ibm/xlC/13.1.3/bin:$PATH  

!  Environment  Setup  export  PATH=/usr/local/cuda-­‐7.0/bin:$PATH  export  LD_LIBRARY_PATH=/usr/local/cuda-­‐7.0/lib64:$LD_LIBRARY_PATH  

!  Compile  and  run  nvcc  -­‐ccbin  xlC  -­‐m64  -­‐Xcompiler  -­‐O3  -­‐Xcompiler  -­‐q64      -­‐Xcompiler  -­‐qsmp=omp  -­‐gencode  arch=compute_20,code=sm_20    -­‐o  cudaOpenMP.o  -­‐c  cudaOpenMP.cu  nvcc  -­‐ccbin  xlC    -­‐m64  -­‐Xcompiler  -­‐O3  -­‐Xcompiler  -­‐q64    -­‐o  cudaOpenMP  cudaOpenMP.o    -­‐lxlsmp  

24  4/11/16  

Page 25: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

SPEC2006cpu  Rate  Performance  

0200400600800100012001400160018002000

POWER S824 Haswell

E5-4655 V3

SPECint_rate2006

1000

1050

1100

1150

1200

1250

1300

1350SPECfp_rate2006

Haswell E4-4655 V3

POWER S824

•  Power8  is  up  to  1.4x  top  24c  Haswell    •  XLC  V13.1/XLF  V15.1  are  10  to  30%  bener  than  GCC  4.9  

25  4/11/16  

Page 26: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

STAC-­‐A2  Performance  

1 1

2.3 2.1

0.0

0.5

1.0

1.5

2.0

2.5

POWER8  24c/192t  

POWER8  24c/192t  

STAC-­‐A2.β2.GREEKS.TIME.WARM   STAC-­‐A2.β2.GREEKS.MAX_PATHS  

E5-­‐2699  v3  36c/72t  

E7-­‐4890  v2  60c/120t  

Rela'v

e  Performan

ce  (spe

edup

)  

hnps://stacresearch.com/news/2015/03/12/stac-­‐report-­‐stac-­‐a2-­‐ibm-­‐power8-­‐soluUon  

•  Power8  double  the  performance  of  Haswell  •  XLC  13.1  is  10%  to  20%  beZer  than  GCC  4.9  

26  4/11/16  

Page 27: CUDAProgramming,and,Performance,Tuning,with, IBMXL ......Revolu’onizing,the, Datacenter JointheConversaon, #OpenPOWERSummit, CUDAProgramming,and,Performance,Tuning,with, IBMXL,Compilers,on,POWER8,Systems,

NAS  Performance  

0  

0.5  

1  

1.5  

2  

2.5  

3  

bt   cg   dc   ep   u   is   lu   mg   sp   ua  

Speedu

p  

NAS  benchmarks  

Speedup  of  XL  13.1  and  XLF  15.1  over  GCC5.2  

•  XLC  13  /  XLF  15  performs  1.36X  bener  than  GCC5.2  

27  4/11/16