a comparison of climate applications on accelerated and conventional architectures

Srinath VadlamaniYoungsung Kimand John DennisASAP-TDD-CISL

NCAR

A comparison of climate applications on accelerated and conventional architectures.

The overall picture of acceleration effort and different techniques should be understood. [Srinath V.] We use small investigative kernels to help teach us. We use instrumentation tools to help us work with the larger

code set.The investigative DG_KERNEL shows what is possible if

everything was simple. [Youngsung K.] DG_KERNEL helped us understand the hardware. DG_KERNEL helped us understand coding practices and software

instructions to achieve superb performance.

Presentation has two parts

ASAP Personnel Srinath Vadlamani, John Dennis, Youngsung Kim,

Michael Arndt, Ben Jamroz and Rich Loft Active collaborators

Intel: Michael Greenfield, Ruchira Sasanka, Sergey Egorov, Karthik Raman, and Mark Lubin

NREL: Ilene Carpenter

Application and Scalability Performance (ASAP) team researching modern micro-architecture for

Climate Codes

Climate simulations simulate 100s to 1000s of years of activity.

Currently high resolution climate simulations rate is 2 ~ 3 simulated year per day (SYPD) [~40k pes].

GPUs and Coprocessors can help to increase SYPD.

Many collaborators mandates the use of many architectures.

We must use these architectures efficiently for successful SYPD speed up, which requires knowing the hardware!

Climate codes ALWAYS NEED A FASTER SYSTEM.

Conventional CPU based:

NCAR Yellowstone (Xeon: SNB) - CESM, HOMME

ORNL TITAN (AMD: Interlagos) - benchmark kernel

Xeon Phi based:

TACC Stampede - CESM

NCAR test system (SE10x changing to 7120) - HOMME

GPU based:

NCAR Caldera cluster (M2070Q) -HOMME

ORNL Titan(K20x) -HOMME

TACC Stampede (K20) - benchmark kernels only.

We have started the acceleration effort a specific platforms.

CESM is a large application so we need to create benchmarks kernels to understand the hardware.Smaller examples are easier to understand and manipulate.

The first two kernels we have created are DG_KERNEL from HOMME [detailed by Youngsung]Standalone driver for WETDEPA_V2.

We can learn how to use accelerated hardware for climate codes by creating representative

examples.

We created DG_KERNEL knowing it could be a well vectorized code (with help).

What if we want to start cherry picking subroutines and loops to try the learned techniques?

Instrumentation tools are available with teams that are willing to support your efforts.Trace based tools offer great detail.Profile tools present summaries upfront.Previous NCAR-SEA conference highlighted

such tools.

Knowing what can be accelerated is half the battle.

Extrae tracing tool developed at Barcelona Supercomputer CenterH. Servat, H. Labart, J. GimenezAutomatic performance identification process is a BSC research project.Produces a time series of communication and hardware counter events.

Paraver is the visualizer that also performs statistical analysis.There are clustering techniques which uses a folding concept plus the research identification process to create “synthetic” traces with fewer samples.

Extrae tracing can pick out problematic regions of a large code.

Clustering groups with similar bad computational characteristics is a good guide.

•Result of an Extrae trace of CESM on Yellowstone.•Similar to exclusive execution time.

Extrae tracing exposed possible waste of cycles.

•Red: Instructions count.•Blue: d(INS)/dt

Paraver identified code regions.

•Trace identifies what code is active when. •We now examine code regions for characteristics amenable to acceleration.

Automatic Performance Identification highlighted these group’s subroutines.

Group A Overall Execution time %

conden 2.7

compute_usschu 3.3

rtrnmc 1.75

Group B

micro_gm_tend 1.36

wetdepa_v2 2.5

Group C

reftra_sw 1.71

spcvmc_sw 1.21

vrtqdr_sw 1.43

•small number of lines of code•ready to be vectorized

The subroutine has sections of double nested loops.These loops are very long with branches.Compilers will have trouble vectorizing loops containing branches.

The restructure started with breaking up loops.We collected scalars into arrays for vector

operations.We broke up very long expressions into smaller

pieces.

wetdepa_v2 can be vectorized with recoding.

Modification of the code does compare well with compiler optimization.

•Vectorizing?•-vec-report=3,6•Code optimized

•Modification was for a small number of lines.•-O3 fast for orig. gave incorrect results

Modified wetdepa_v2 placed back in to CESM on SNB shows better use of resources.

•2.5% --> .7% of overall execution in CESM on Yellowstone.

CAM-SE configuration was profiled on Stampede at TACC using TAU.It provides different levels of introspection of subroutine and loop efficiency.

This process taught us some more about hardware counter metrics.

Initial investigation fits in the core-count to core-count comparison.

Profilers are also useful tools for understanding code efficiency in the BIG code.

Hot Spots can be associated with largest exclusive execution time.

Long time may be a branchy section of code.

Long exclusive time on both devices is a good place to start looking.

Low VI is a candidate for acceleration techniques

High VI could be misleading.

Note: The VI metric is defined differently on Sandybridge and Xeon Phi. http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops

Possible speedup can be achieved with a gain in Vectorization Intensity (VI)

CESM on KNC not competitive today.

Deviceavg. time step [s]

Sandybridge -O2

30.88

KNC -O0 1566.7KNC -O2,

“derivatitive_mod.F90” -

O1

660.48

KNC -O2, “derivatitive_mod.F

90” -O1-align array64bytes

516.83

• FC5 ne16g37

• 16 MPI ranks/node

• 1 rank/core• 8 nodes• Single

thread

Device MPI rank/device

Threads/ MPI rank

Avg. 1 day dt [s]

Dual SNB 16 1 4.07

Dual SNB 1 16 6.74

Xeon Phi SE10

60 2 611.60

Xeon Phi SE10

1 120 31.66

Xeon Phi SE10

1 192 32.36

Xeon Phi SE10

2 96 24.07

Xeon Phi SE10

8 24 18.60

Xeon Phi SE10

16 12 19.74

Hybrid-parallelism is promising for CESM on KNC

• FCIDEAL ne16ne16• Stampede: 8 nodes• F03 use of

allocatable derived type components to overcome threading issue [all –O2]

• Intel compiler and IMPI

• KNC 4.6x slower•Will get better with Xeon Phi tuning techniques

CESM is running on the TACC Stampede KNC cluster.We are more familiar with possibilities on GPUs and KNCs

by using climate code benchmark kernels.

Kernels are useful for discovering acceleration strategies and hardware investigations. Results are promising.

We now have tracing and profiling tool knowledge to help identify acceleration possibilities with in the large code base.

We have strategies for symmetric operation as a very attractive mode of execution.

Though CESM is not competitive on a KNC cluster today, the kernel experience shows what is possible.

Part 1. Conclusion: We are hopeful to see speedup on accelerated hardware.

ASAP/TDD/CISL/NCAR

Youngsung Kim

Performance Tuning Techniques

for GPU and MIC

Introduction Kernel-based approach Micro-architectures

MIC performance evolutionsCUDA-C performance evolutionsCPU performance evolutions along with MIC evolutionsGPU programming : Open ACC, CUDA Fortran, and F2C-ACCOne source considerationSummary

Contents

What is a kernel? A small computation-intensive part of existing large code Represent characteristics of computations

Benefit of kernel-based approach Easy to manipulate and understand

CESM: >1.5M LOC Easy to convert to various programming technologies

CUDA-C, CUDA-Fortran, OpenACC, and F2C-ACC Easy to isolate issues for analysis

Simplify hardware counter analysis

Motivation of kernel-based approach

Origin*a kernel derived from the computational part of the gradient calculation in the Discontinuous Galerkin formulation of the shallow water equations from HOMME.

Implementation from HOMMESimilar to “dg3d_gradient_mass” function in “dg3d_core_mod.F90”

Calculate gradient of flux vectors and update the flux vectors using the calculated gradient

DG kernel

*: D. Nair, Stephen J. Thomas, and Richard D. Loft: A discontinuous Galerkin global shallow water model, Monthly Weather Review, Vol. 133, pp 876-888

Floating point operations No dependancy between elements # of elements

Can be calculated from source code analytically Ex.) When nit=1000, nelem=1024, nx=4(npts=nx*nx) ≈ 2 GFLOP

OpenMP Two OpenMP Parallel regions for Do loops on element index(ie)

DG KERNEL – source code

CPU Conventional multi-core: 1 ~ 16+ cores/~256-bit vector registers Many programming language: Fortran, C/C++, etc. Intel SandyBridge E5-2670

Peak performance(2 Sockets): 332.8 DP GFLOPS(Estimated by presenter)

MIC Based on Intel Pentium cores with extensions including wider vector registers. Many core and wider vector: 60+ cores/512-bit vector registers Limited programming language(extensions only from Intel): C/C++, Fortran Intel KNC( a.k.a. MIC)

Peak Performance(7120): 1.208 DP TFLOPS

GPU Many light-weight threads: ~2680+ threads(threading & vectorization) Limited programming language(extensions): CUDA-C, CUDA-Fortran, OpenCL,

OpenACC, F2C-ACC, etc. Peak performances

Nvidia K20x: 1.308 DP TFLOPS Nvidia K20: 1.173 DP TFLOPS Nvidia M2070Q: 515.2 GFLOPS

Micro-architectures

The best performance results from CPU, GPU, and MIC

5.4x

6.6x

MIC

Compiler options -mmic

Environmental variables OMP_NUM_THREADS=240 KMP_AFFINITY =

'granularity=fine,compact'Native mode only

No cost of memory copy between CPU and MIC

Supports from Intel R. Sasanka

MIC evolution

15.6x

Source modification NONE

Compiler options -mmic –openmp –O3

MIC ver. 1

Source codei = 1s1 = (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)i = i + 1s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)

Compiler options -mmic -openmp –O3

Performance Considerations Complete unroll of three nested loops Vectorized, but not efficiently enough

MIC ver. 2

MIC ver. 3

MIC ver. 4

MIC ver. 5

CPU Evolutions with MIC evolutions

Generally, performance tuning on a micro-architecture also helps to improve performance on another micro-architecture. However, it is not always true.

GPU

Compiler options -O3 -arch=sm_35 same to all versions

“Offload mode” only However, the time cost

for data copy between CPU and GPU is not included for comparison to MIC native mode

CUDA-C Evolutions

14.2x

CUDA-C ver. 1

CUDA-C ver. 2

CUDA-C ver. 3

CUDA-C ver. 4

Source Code ie = (blockidx%x - 1)*NDIV + (threadidx%x - 1)/(NX*NX) + 1 ii = MODULO(threadIdx%x - 1, NX*NX) + 1 IF (ie > SET_NELEM) RETURN

k = MODULO(ii-1,NX) + 1 l = (ii - 1)/NX + 1 s2 = 0.0_8 DO j=1, NX s1 = 0.0_8 DO i = 1, NX s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + & delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) END DO ! i loop s2 = s2 + s1*gw(j) END DO ! j grad(ii,ie) = s2 flx(ii,ie) = flx(ii,ie)+ dt*grad(ii,ie) fly(ii,ie) = fly(ii,ie)+ dt*grad(ii,ie)

Performance considerations Maintains source structure of original

Fortran Needs understanding on CUDA

threading model, especially for debugging and performance tuning

Supports implicit memory copy, which is convenient but could negatively impact to performance if over-used.

CUDA-Fortran

OpenACC

F2C-ACC

One source is highly desirable Hard to manage versions from multiple micro-architectures and

multiple programming technologies Performance enhancement can be applied to multiple versions

simultaneouslyConditional Compilation

Macro to insert & delete code for a particular technology User control compilation by using compiler macro

Hard to get one source for CUDA-C Many scientific codes are written in Fortran CUDA-C has quite different code structure and should be written

in C Performance impact

Highest performance tuning techniques hardly allow one source

One source

Faster hardware provides us with potential to the performance. However, we can exploit the potential only through better software.

Better software on accelerators generally means that it utilizes many cores and wide vectors simultaneously and efficiently.

In practice, those massive parallelisms can be achieved effectively by, among others, 1) re-using data that are loaded onto faster memory and 2) accessing successive array elements with aligned unit-stride manner.

Conclusions

Using those techniques, we have achieved considerable amount of speed-ups for DG KERNEL Speed-ups compared the best one socket SandyBridge

performance MIC: 6.6x GPU: 5.4x

Speed-ups from initial version to the best performed version MIC: 15.6x GPU: 14.2x

Our next challenge is to applying the techniques that we have learned from kernel experiments to real software package.

Conclusions - continued

Contacts: [email protected], [email protected]: http://www2.cisl.ucar.edu/org/cisl/tdd/asapCESM: http://www2.cesm.ucar.eduHOMME: http://www.homme.ucar.eduExtrae:

http://www.bsc.es/es/computer-sciences/performance-tools/trace-generation

TAU: http://www.cs.uoregon.edu/research/tau/home.php

Thank you for your attention.

a comparison of climate applications on accelerated and conventional architectures

Documents