a comparison of climate applications on accelerated and conventional architectures
DESCRIPTION
A comparison of climate applications on accelerated and conventional architectures. Srinath Vadlamani Youngsung Kim and John Dennis ASAP-TDD-CISL NCAR. Presentation has two parts. The overall picture of acceleration effort and different techniques should be understood. [ Srinath V.] - PowerPoint PPT PresentationTRANSCRIPT
Srinath VadlamaniYoungsung Kimand John DennisASAP-TDD-CISL
NCAR
A comparison of climate applications on accelerated and conventional architectures.
The overall picture of acceleration effort and different techniques should be understood. [Srinath V.] We use small investigative kernels to help teach us. We use instrumentation tools to help us work with the larger
code set.The investigative DG_KERNEL shows what is possible if
everything was simple. [Youngsung K.] DG_KERNEL helped us understand the hardware. DG_KERNEL helped us understand coding practices and software
instructions to achieve superb performance.
Presentation has two parts
ASAP Personnel Srinath Vadlamani, John Dennis, Youngsung Kim,
Michael Arndt, Ben Jamroz and Rich Loft Active collaborators
Intel: Michael Greenfield, Ruchira Sasanka, Sergey Egorov, Karthik Raman, and Mark Lubin
NREL: Ilene Carpenter
Application and Scalability Performance (ASAP) team researching modern micro-architecture for
Climate Codes
Climate simulations simulate 100s to 1000s of years of activity.
Currently high resolution climate simulations rate is 2 ~ 3 simulated year per day (SYPD) [~40k pes].
GPUs and Coprocessors can help to increase SYPD.
Many collaborators mandates the use of many architectures.
We must use these architectures efficiently for successful SYPD speed up, which requires knowing the hardware!
Climate codes ALWAYS NEED A FASTER SYSTEM.
Conventional CPU based:
NCAR Yellowstone (Xeon: SNB) - CESM, HOMME
ORNL TITAN (AMD: Interlagos) - benchmark kernel
Xeon Phi based:
TACC Stampede - CESM
NCAR test system (SE10x changing to 7120) - HOMME
GPU based:
NCAR Caldera cluster (M2070Q) -HOMME
ORNL Titan(K20x) -HOMME
TACC Stampede (K20) - benchmark kernels only.
We have started the acceleration effort a specific platforms.
CESM is a large application so we need to create benchmarks kernels to understand the hardware.Smaller examples are easier to understand and manipulate.
The first two kernels we have created are DG_KERNEL from HOMME [detailed by Youngsung]Standalone driver for WETDEPA_V2.
We can learn how to use accelerated hardware for climate codes by creating representative
examples.
We created DG_KERNEL knowing it could be a well vectorized code (with help).
What if we want to start cherry picking subroutines and loops to try the learned techniques?
Instrumentation tools are available with teams that are willing to support your efforts.Trace based tools offer great detail.Profile tools present summaries upfront.Previous NCAR-SEA conference highlighted
such tools.
Knowing what can be accelerated is half the battle.
Extrae tracing tool developed at Barcelona Supercomputer CenterH. Servat, H. Labart, J. GimenezAutomatic performance identification process is a BSC research project.Produces a time series of communication and hardware counter events.
Paraver is the visualizer that also performs statistical analysis.There are clustering techniques which uses a folding concept plus the research identification process to create “synthetic” traces with fewer samples.
Extrae tracing can pick out problematic regions of a large code.
Clustering groups with similar bad computational characteristics is a good guide.
•Result of an Extrae trace of CESM on Yellowstone.•Similar to exclusive execution time.
Extrae tracing exposed possible waste of cycles.
•Red: Instructions count.•Blue: d(INS)/dt
Paraver identified code regions.
•Trace identifies what code is active when. •We now examine code regions for characteristics amenable to acceleration.
Automatic Performance Identification highlighted these group’s subroutines.
Group A Overall Execution time %
conden 2.7
compute_usschu 3.3
rtrnmc 1.75
Group B
micro_gm_tend 1.36
wetdepa_v2 2.5
Group C
reftra_sw 1.71
spcvmc_sw 1.21
vrtqdr_sw 1.43
•small number of lines of code•ready to be vectorized
The subroutine has sections of double nested loops.These loops are very long with branches.Compilers will have trouble vectorizing loops containing branches.
The restructure started with breaking up loops.We collected scalars into arrays for vector
operations.We broke up very long expressions into smaller
pieces.
wetdepa_v2 can be vectorized with recoding.
Modification of the code does compare well with compiler optimization.
•Vectorizing?•-vec-report=3,6•Code optimized
•Modification was for a small number of lines.•-O3 fast for orig. gave incorrect results
Modified wetdepa_v2 placed back in to CESM on SNB shows better use of resources.
•2.5% --> .7% of overall execution in CESM on Yellowstone.
CAM-SE configuration was profiled on Stampede at TACC using TAU.It provides different levels of introspection of subroutine and loop efficiency.
This process taught us some more about hardware counter metrics.
Initial investigation fits in the core-count to core-count comparison.
Profilers are also useful tools for understanding code efficiency in the BIG code.
Hot Spots can be associated with largest exclusive execution time.
Long time may be a branchy section of code.
Long exclusive time on both devices is a good place to start looking.
Low VI is a candidate for acceleration techniques
High VI could be misleading.
Note: The VI metric is defined differently on Sandybridge and Xeon Phi. http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
Possible speedup can be achieved with a gain in Vectorization Intensity (VI)
CESM on KNC not competitive today.
Deviceavg. time step [s]
Sandybridge -O2
30.88
KNC -O0 1566.7KNC -O2,
“derivatitive_mod.F90” -
O1
660.48
KNC -O2, “derivatitive_mod.F
90” -O1-align array64bytes
516.83
• FC5 ne16g37
• 16 MPI ranks/node
• 1 rank/core• 8 nodes• Single
thread
Device MPI rank/device
Threads/ MPI rank
Avg. 1 day dt [s]
Dual SNB 16 1 4.07
Dual SNB 1 16 6.74
Xeon Phi SE10
60 2 611.60
Xeon Phi SE10
1 120 31.66
Xeon Phi SE10
1 192 32.36
Xeon Phi SE10
2 96 24.07
Xeon Phi SE10
8 24 18.60
Xeon Phi SE10
16 12 19.74
Hybrid-parallelism is promising for CESM on KNC
• FCIDEAL ne16ne16• Stampede: 8 nodes• F03 use of
allocatable derived type components to overcome threading issue [all –O2]
• Intel compiler and IMPI
• KNC 4.6x slower•Will get better with Xeon Phi tuning techniques
CESM is running on the TACC Stampede KNC cluster.We are more familiar with possibilities on GPUs and KNCs
by using climate code benchmark kernels.
Kernels are useful for discovering acceleration strategies and hardware investigations. Results are promising.
We now have tracing and profiling tool knowledge to help identify acceleration possibilities with in the large code base.
We have strategies for symmetric operation as a very attractive mode of execution.
Though CESM is not competitive on a KNC cluster today, the kernel experience shows what is possible.
Part 1. Conclusion: We are hopeful to see speedup on accelerated hardware.
ASAP/TDD/CISL/NCAR
Youngsung Kim
Performance Tuning Techniques
for GPU and MIC
Introduction Kernel-based approach Micro-architectures
MIC performance evolutionsCUDA-C performance evolutionsCPU performance evolutions along with MIC evolutionsGPU programming : Open ACC, CUDA Fortran, and F2C-ACCOne source considerationSummary
Contents
What is a kernel? A small computation-intensive part of existing large code Represent characteristics of computations
Benefit of kernel-based approach Easy to manipulate and understand
CESM: >1.5M LOC Easy to convert to various programming technologies
CUDA-C, CUDA-Fortran, OpenACC, and F2C-ACC Easy to isolate issues for analysis
Simplify hardware counter analysis
Motivation of kernel-based approach
Origin*a kernel derived from the computational part of the gradient calculation in the Discontinuous Galerkin formulation of the shallow water equations from HOMME.
Implementation from HOMMESimilar to “dg3d_gradient_mass” function in “dg3d_core_mod.F90”
Calculate gradient of flux vectors and update the flux vectors using the calculated gradient
DG kernel
*: D. Nair, Stephen J. Thomas, and Richard D. Loft: A discontinuous Galerkin global shallow water model, Monthly Weather Review, Vol. 133, pp 876-888
Floating point operations No dependancy between elements # of elements
Can be calculated from source code analytically Ex.) When nit=1000, nelem=1024, nx=4(npts=nx*nx) ≈ 2 GFLOP
OpenMP Two OpenMP Parallel regions for Do loops on element index(ie)
DG KERNEL – source code
CPU Conventional multi-core: 1 ~ 16+ cores/~256-bit vector registers Many programming language: Fortran, C/C++, etc. Intel SandyBridge E5-2670
Peak performance(2 Sockets): 332.8 DP GFLOPS(Estimated by presenter)
MIC Based on Intel Pentium cores with extensions including wider vector registers. Many core and wider vector: 60+ cores/512-bit vector registers Limited programming language(extensions only from Intel): C/C++, Fortran Intel KNC( a.k.a. MIC)
Peak Performance(7120): 1.208 DP TFLOPS
GPU Many light-weight threads: ~2680+ threads(threading & vectorization) Limited programming language(extensions): CUDA-C, CUDA-Fortran, OpenCL,
OpenACC, F2C-ACC, etc. Peak performances
Nvidia K20x: 1.308 DP TFLOPS Nvidia K20: 1.173 DP TFLOPS Nvidia M2070Q: 515.2 GFLOPS
Micro-architectures
The best performance results from CPU, GPU, and MIC
5.4x
6.6x
MIC
Compiler options -mmic
Environmental variables OMP_NUM_THREADS=240 KMP_AFFINITY =
'granularity=fine,compact'Native mode only
No cost of memory copy between CPU and MIC
Supports from Intel R. Sasanka
MIC evolution
15.6x
Source modification NONE
Compiler options -mmic –openmp –O3
MIC ver. 1
Source codei = 1s1 = (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)
Compiler options -mmic -openmp –O3
Performance Considerations Complete unroll of three nested loops Vectorized, but not efficiently enough
MIC ver. 2
MIC ver. 3
MIC ver. 4
MIC ver. 5
CPU Evolutions with MIC evolutions
Generally, performance tuning on a micro-architecture also helps to improve performance on another micro-architecture. However, it is not always true.
GPU
Compiler options -O3 -arch=sm_35 same to all versions
“Offload mode” only However, the time cost
for data copy between CPU and GPU is not included for comparison to MIC native mode
CUDA-C Evolutions
14.2x
CUDA-C ver. 1
CUDA-C ver. 2
CUDA-C ver. 3
CUDA-C ver. 4
Source Code ie = (blockidx%x - 1)*NDIV + (threadidx%x - 1)/(NX*NX) + 1 ii = MODULO(threadIdx%x - 1, NX*NX) + 1 IF (ie > SET_NELEM) RETURN
k = MODULO(ii-1,NX) + 1 l = (ii - 1)/NX + 1 s2 = 0.0_8 DO j=1, NX s1 = 0.0_8 DO i = 1, NX s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + & delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) END DO ! i loop s2 = s2 + s1*gw(j) END DO ! j grad(ii,ie) = s2 flx(ii,ie) = flx(ii,ie)+ dt*grad(ii,ie) fly(ii,ie) = fly(ii,ie)+ dt*grad(ii,ie)
Performance considerations Maintains source structure of original
Fortran Needs understanding on CUDA
threading model, especially for debugging and performance tuning
Supports implicit memory copy, which is convenient but could negatively impact to performance if over-used.
CUDA-Fortran
OpenACC
F2C-ACC
One source is highly desirable Hard to manage versions from multiple micro-architectures and
multiple programming technologies Performance enhancement can be applied to multiple versions
simultaneouslyConditional Compilation
Macro to insert & delete code for a particular technology User control compilation by using compiler macro
Hard to get one source for CUDA-C Many scientific codes are written in Fortran CUDA-C has quite different code structure and should be written
in C Performance impact
Highest performance tuning techniques hardly allow one source
One source
Faster hardware provides us with potential to the performance. However, we can exploit the potential only through better software.
Better software on accelerators generally means that it utilizes many cores and wide vectors simultaneously and efficiently.
In practice, those massive parallelisms can be achieved effectively by, among others, 1) re-using data that are loaded onto faster memory and 2) accessing successive array elements with aligned unit-stride manner.
Conclusions
Using those techniques, we have achieved considerable amount of speed-ups for DG KERNEL Speed-ups compared the best one socket SandyBridge
performance MIC: 6.6x GPU: 5.4x
Speed-ups from initial version to the best performed version MIC: 15.6x GPU: 14.2x
Our next challenge is to applying the techniques that we have learned from kernel experiments to real software package.
Conclusions - continued
Contacts: [email protected], [email protected]: http://www2.cisl.ucar.edu/org/cisl/tdd/asapCESM: http://www2.cesm.ucar.eduHOMME: http://www.homme.ucar.eduExtrae:
http://www.bsc.es/es/computer-sciences/performance-tools/trace-generation
TAU: http://www.cs.uoregon.edu/research/tau/home.php
Thank you for your attention.