mb3 d6.9 { performance analysis of applications and mini ......d6.9 - performance analysis of...

MB3 D6.9 – Performance analysis of applications andmini-applications and benchmarking on the project test platforms

Version 1.0

Document Information

Contract Number 671697

Project Website www.montblanc-project.eu

Contractual Deadline PM39

Dissemination Level Public

Nature Report

Editor Filippo Mantovani (BSC)

Authors Fabio Banchelli, Marta Garcia, Marc Josep, Filippo Mantovani, Ju-lian Morillo, Kilian Peiro, Guillem Ramirez, Xavier Teruel (BSC),Giacomo Valenzano, Joel Wanza Weloli (ATOS/Bull), Jose Gra-cia (USTUTT), Alban Lumi, Daniel Ganellari (U Graz), PatrickSchiffmann (AVL)

Reviewers Jesus Labarta (BSC), Roxana Rusitoru, Daniel Ruiz (Arm)

Keywords Benchmarks, Cavium ThunderX2, Dibona, Energy To Solution,Fluid Dynamics, HPCG, HPGMG, HPL, Infiniband, Lattice Boltz-mann, Lulesh, MPI, OmpSs, OpenMP, Respiratory System, Run-time STREAM, Solver, TensorFlow, Weather Forecast,

Notices: This project has received funding from the European Union’s Horizon 2020 research and innovation

programme under grant agreement No 671697.

c©Mont-Blanc 3 Consortium Partners. All rights reserved.

D6.9 - Performance analysis of applications on DibonaVersion 1.0

Contents

Executive Summary 3

1 Dibona cluster description 41.1 Platform Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 User Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Software Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Platform Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Power Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Architectural Micro-benchmarks 112.1 Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Floating Point Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 HPC Benchmarks and Proxy-apps 203.1 LINPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 HPCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Lulesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Jacobi Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5 Eikonal Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.6 RBF Interpolation Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Scientific applications 544.1 Alya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 OpenIFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.4 LBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Conclusions 79

2


Executive Summary

This document reports the activities planned in Mont-Blanc 3 WP6 under tasks T6.7:

T6.7 Porting, optimisation and analysis on the project test platform(m20:m39) We will port the kernels, mini-applications and full applicationsto the WP3 test platform and WP4 mini-clusters. Whether we will port allof them or a subset of the codes will be decided at this stage of the project.Even if not all of them are considered, the objective is to have a large enoughset of codes for benchmarking and performance analysis. The set of codes willbe optimised for the test platform when further optimisations beyond thoseapplied in T6.4 are required.We will use this set of ported and optimised codes for benchmarking the testplatform and mini-clusters against state-of-the-art HPC solutions and produc-tion systems. In this effort, we will use existing PRACE Tier-0 supercomput-ers, amongst other platforms, as the state-of-the-art baselines.We will also analyse the performance of the selected kernels, mini-applicationsand full applications on the test platform and mini-clusters to determine thestrengths and weaknesses of the design. We will leverage the performanceanalysis tools provided by the partners, such as the Paraver tool from BSC,to drive this performance analysis.

This report collects the contributions of the Mont-Blanc 3 partners in the evaluation of theDibona test platform. We structured the evaluation with a bottom-up approach, executingprograms with an increasing level of complexity. Where appropriate, we complete each sectionswith scalability and energy measurements.

• In Section 1 we briefly introduce the Dibona test platform that has been extensivelydescribed in D3.3 Detailed specification of the medium-sized test platform.

• In Section 2 we evaluate the simplest micro-benchmarks, exposing the basic architecturalfeatures such as the floating point throughput of the CPU, the structure of the memorysubsystem and the bandwidth and the latency of the network.

• In Section 3 we report the results of the most relevant high-performance computing bench-marks LINPACK and HPCG together with the mini-apps Lulesh and commonly usedsolvers. For HPCG we include the description and the evaluation of a shared memoryversion of the banchmark developed within the Mont-Blanc 3 consortium.

• In Section 4 we present the tests performed on Dibona with production scientific appli-cations combined with the runtime optimizations introduced in D6.5 Initial report onautomatic region of interest extraction and porting to OpenMP4.0-OmpSs.

Beyond the pure technical content, this document is the result of a deep and continouscollaboration effort between WP3, where the Dibona test platform has been developed, andWP6, where the applications have been ported to the Dibona test platform. This process madepossible to implement a solid deployment method, making Dibona a production-ready machine.

3


1 Dibona cluster description

1.1 Platform Description

Figure 1 summarizes the constituent elements of the Dibona platform. The cluster is a Fat-treewith a pruning factor of 1/2 at level 1 (L1) of Mellanox IB EDR-100 switches. There are 45ThunderX2 based bi-socket compute nodes. Each compute node has 64 Armv8 cores, 32MBL3 cache and 256 GB of memory over 16 channels. There is a separated management network(Ethernet) for the monitoring and NFS services common to all the nodes. For more details,refer to the deliverable D3.1: High Level Description of the medium-sized test platform. Toaccess the cluster, users get connected from the login node to allocate nodes or to run their jobsusing the SLURM queue and the software stack described in the next section.

Figure 1: Dibona Platform Overview

1.2 Software Stack

The software stack in Dibona is built on the Red Hat Enterprise Linux 7.5 for Alternate Archi-tectures, and is based on the Bull Super Computing Suite 5 (SCS5). Software such as compilersand MPI libraries are available in different versions, the environment modules solution is usedto manage these versions and to allow the users to choose their preferred software. Here wedescribe the main components of the software packages installed.

1.2.1 Operating System

The operating system running on Dibona is the standard RHEL 7.5, except for the kernelwhich has been patched by Bull. The official RHEL 7.5 kernel (4.14.0-49) has two importantlimitations on Armv8 architecture, which have been remedied by the patch:

• Lack of support for Dynamic Voltage and Frequency Scaling (DVFS). The patch enablesthe cpufreq capability, which supports per-core frequency control using the cpupowercommand.

4


• Lack of support for perf uncore events for the ThunderX2 processor. The patch providesKernel PMU events for the DDR4 Memory Controller and Level 3 Cache.

Jobs are managed via the cluster manager SLURM, version 17.02. The original Bull releasewas patched to support frequency scaling per-job and per-core on Armv8 compute nodes.

1.2.2 Interconnect

The proprietary Mellanox OpenFabrics Enterprise Distribution (MOFED) is installed on Di-bona to enable the Infiniband Interconnect. The Armv8 compute nodes have the latest versionavailable (at the time of writing) for the operating system, MOFED version 4.3-3.0.2.1. Thisversion provides initial support for the open-source “Unified Communication X” (UCX) accel-eration for MPI, instead of the proprietary MXM technology available for the x86 architecture.

The head node of Dibona (x86) runs an older version of MOFED, 3.4.2, but is able to allowcommunication between the compute nodes using the Open Subnet Manager (opensm).

1.2.3 Compilers

The Dibona platform provides the following open-source compilers for the Armv8 architecture:

• GNU GCC 4.8.5: This is the default compiler from RHEL 7.5.

• GNU GCC 5.3.0

• GNU GCC 7.2.1: Installed from the Red Hat Software Collections Library (RHSCL)version 3.0.

• GNU GCC 8.2.0: The latest stable release at the time of writing.

The original GCC compiler found in RHEL 7.5 was limited. Newer versions provide ex-plicit support to the ThunderX2 processor (-mcpu=thunderx2t99 compilation flag) , includinginformation on instruction duration, prefetch, and cache; for improved scheduling of compiledinstructions.

Since 2016, Arm distributes the Arm HPC Compiler1, an LLVM-based HPC compiler bun-dled with a set of high-performance mathematical libraries, called Arm Performance Libraries.Banchelli et al. [1], performed an early evaluation of those tools and they measured signifi-cant difference in performance between GCC and the Arm HPC Compiler. Dibona providesthe installation of this Arm proprietary tools, which also optimize the use of the ThunderX2processor. Several versions are available, here are the two most recent ones:

• Arm HPC Compiler 18.4.2: Latest version based on LLVM 5.0.1.

• Arm HPC Compiler 19.0: Version based on LLVM 7.0.2.

1.2.4 Open MPI Library

The Armv8 compute nodes have Bull OpenMPI 2.0.2 installed. The Bull version adds Armv8support to the Open Portable Access Layer (OPAL), to improve low-level implementation suchas atomic operations and cycle counts. Several flavors of this version of OpenMPI are installed,one for each compiler, all with multi-threading support.

1https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc

5

https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc

https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc


This OpenMPI provides UCX support, in the form of a new point-to-point managementlayer (PML). The UCX PML showed improved performance for some applications over thedefault ob1 Infiniband PML (about 2% for OSU BW benchmark v5.3.2 between two nodes,and about 3% for HPCG 3.0 between two nodes). However, it was detected that UCX mayprovide decreased performance in some cases (closed-source fluid dynamics code).

While the Bull OpenMPI 2.0.2 has been continually updated with relevant bugfixes andperformance improvements from newer versions, development focuses on MXM and does notoficially support UCX yet. In order to better understand the reduced performance detectedwith the UCX PML with certain workloads, the “vanilla” OpenMPI 3.1.2 was tested on theDibona platform. It was observed that performace was within an expected range, even with theclosed-source code. The OpenMPI changelog2 indicates that several bugfixes related to UCXhave been added, which have not been incorporated into Bull OpenMPI yet. For additionalcomparisons, and in order to test a bug involving job freezes, the latest OpenMPI 4.0.0 is alsoavailable in Dibona.

1.2.5 Performance Tools

Here we list some tools that are part of Bull SCS5, these have been specifically built for Armv8nodes:

• PAPI: this tool provides a common interface and methodology to work with hardwareperformance counters. Bull PAPI 5.5.1 has been patched to support the core events ofthe ThunderX2 processor. In addition to the native interface, which allows access to all151 native core events3, 26 PAPI preset events4 were mapped.

• HPCToolkit: an integrated suite of tools for measurement and analysis of applicationperformance. The available version 2017.06 is the result of collaboration between Bulland HPCToolkit developers to integrate Armv8 support to the main source code.

• Allinea Tools: Dibona includes Allinea Forge 18.1.3 and Allinea Reports 18.1.3 from Arm,to analyze the performance of parallel applications.

• Arm Performance Libraries5: this set of linear algebra libraries optimized for Arm archi-tecture are installed and constantly updated in Dibona. In the following sections, partnersof WP6 will show their evaluation of such libraries with applications.

• BSC Tools: BSC instrumentation package Extrae has been installed and updated in Di-bona, making possible the collection of performance traces that have been studied for adeeper understanding of the architecture.

1.2.6 Mont-Blanc legacy system software stack

The BSC team made available to all users a shared folder with the optimized recompilationof most of the packages composing the legacy Mont-Blanc system software described in D4.8– Final Report on porting runtime to 64-bit Armv8 ISA of the Mont-Blanc 2 project. Thisincludes the OmpSs programming model leveraged for several of the results presented in thisdeliverable.

2https://www.open-mpi.org/source/new.php3https://www.cavium.com/pdfFiles/Cavium_ThunderX2_CN99XX_PMU_Events_v1.pdf4http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:PAPI_presets.35https://developer.arm.com/products/software-development-tools/hpc/arm-performance-libraries

6

https://www.open-mpi.org/source/new.php

https://www.cavium.com/pdfFiles/Cavium_ThunderX2_CN99XX_PMU_Events_v1.pdf

http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:PAPI_presets.3


1.3 User Management

The user management solution is based on the Identity Management (IdM) framework fromRHEL, which enables authentication and authorization capabilities for Dibona. IdM uses Ker-beros and LDAP to maintain a centralized database of users and their credentials, as well asof nodes identities and services. The user information includes first and last names, email,SSH/Kerberos credentials and the home directory.

When a user account is created, it is registered with IdM with a random password thatexpires upon first login. When the user accesses the login node via SSH, the system automat-ically creates a Kerberos ticket, which provides a single-sign-on session to the compute nodes.However, the user is not allowed to freely access the compute nodes via SSH; a valid SLURMallocation is required. This restriction is achieved via the PAM plugin from SLURM. As aresult, requests for exclusive access to compute nodes provide a controlled environment.

The compute nodes are registered with IdM as clients offering the SSH service. IdM autho-rizes the use of the SSH service for all registered users to the compute nodes via an HBAC rule,but access to the head node remains restricted. As an additional advantage of IdM, once a newcompute node is registered with the IdM server, the accounts of all users are automatically rec-ognized in the compute node, effectively making the management of users and nodes a scalabletask.

1.4 Software Management

The operating system and software packages of the compute nodes was initially installed on thecorresponding hard disks. NFS shares were used to maintain common configuration in specificcases, such as SLURM. In order to maintain a homogeneous state of the compute nodes, textfiles were used to keep track of the installed packages and some configuration files. However,this basic approach had significant shortcomings:

Consistency errors – Changing the software stack, such as adding compilers and OpenMPIflavors, was prone to consistency errors. For example, originally GCC 6.2.1 was supported in thefirst compute nodes, with its own OpenMPI flavor. But with a processor upgrade planned, onetest compute node had GCC compiler 7.3.0 instead. When the hardware was finally was finallyupgraded for all the cluster, this node was used to clone the disk of the newer nodes, and madeaccessible to users. The result was that users were unable to execute OpenMPI applicationsdue to incompatible libraries.

Experiments with Operating System – During the development of the kernel patches, newerkernels were installed in a compute note marked for testing. However, a newer kernel inadver-tently upgraded the GLIBC library of the operating system. After rebooting into the originalkernel and releasing the node to users, the users were unable to compile programs. Similarly,changing the kernel version required recompilation of the MOFED stack. But this was notalways done after each change, resulting in a failure when using Infiniband communications.

Cloning issues – Adding new compute nodes to the cluster involved cloning the hard disk ofan existing node. However, this approach produced conflicts with the IdM solution, previouslydescribed. The framework does not allow for two nodes with the same credentials, and rein-stalling IdM in the new compute node led to users complaining of errors when logging into theoriginal node used to clone. The solution in practice was to also reinstall IdM in the originalnode, every time a new compute node was added.

The diskless approach was implemented to mitigate these problems. With this approach,the nodes do not require the hard disk (or EFI shell) to boot; instead they boot directly viaiPXE. A DHCP/PXE server was configured in the head node to provide an iPXE script on

7


boot request, which in turn requests for the kernel and initial ramdisk via the HTTP protocol(TFTP was also possible). The root filesystem is mounted via the NFS protocol, and read-only.

This approach has some significant advantages, such as solving all the three problems above.Additionally, it allowed users to propose changes to the software stack that could be easilyconfined to a subset of nodes. When the changes were validated, all that was needed was tomark the nodes to boot the updated image. The procedure for updating the image is as follows:

1. Create a copy of the current diskless image;

2. Share the new diskless image as a writable NFS share;

3. Create a copy of the iPXE script, pointing to the new diskless image;

4. Modify the DHCP configuration so that a chosen compute node requests the new iPXEscript on boot;

5. Update the image by modifying the filesystem of the chosen compute node;

6. Remount the diskless image as read-only.

1.5 Platform Management

During the course of the project, several issues have come up within the Dibona platform.These issues have been reported mostly by the BSC team and addressed by the Bull staff. Weemployed an issue tracker installed at BSC for handling the communication and allow all usersto be informed about the status of the machine. This summary of software issues aims to reportnoteworthy experiences and lessons learned, and is divided by topic.

1.5.1 Diskless approach

The diskless approach, while mitigated significant shortcomings, also introduced new challengesand issues. Here we list some:

Authentication issues – the IdM solution for user management requires the identity of thecompute node to remain constant. This means that it is not possible to install IdM in thediskless image. Instead, an automatic script was designed to perform a clean install of IdM oneach reboot. The script was found to fail at times, and users were unable to login to the node.A diagnostic script was developed to run several tests regarding the functionalities provided byIdM. The cause of failure is believed to be a race condition between the installation of IdMand the NTP configuration, the NTP service was revised to ensure correct system time beforeinstalling IdM.

I/O performance – The NFS filesystem significantly affected responsiveness of the computenodes, especially after a reboot. This was not shown to affect the performance of compute-intensive and memory-intensive benchmarks, but the nodes showed high latency when modifyingfiles. The problem was initially mitigated by setting the NFS mount as “asynchronous”, withoutobserved downsides (filesystem is read-only). The power monitoring solution was also shown toaffect the performance of the NFS mount.

Freezes during boot – The diskless approach relies on copying runtime directories (such as/var) and remounting them in memory. If the in-memory partition was too small, the bootprocess would freeze. It was observed that the sparse file /var/log/lastlog was viewed astoo large by the copy process. This file is now deleted from any newly-created diskless image.

8


1.5.2 SLURM

Requested node configuration is not available – Detected after upgrading a diskless image. Theproblem was related to file permissions of the munge software, required by SLURM. Now, thecp− a command is used when copying a diskless image, to maintain the state of files. A similarproblem was also detected when installing a new compute node, the topology needing to beupdated (/etc/slurm/topology.conf).

Compute nodes appear as drained in the queue – SLURM decides that nodes do not meet theexpected hardware configuration. This happened mainly because of an incorrect boot process,which misconfigures the number of cores detected by the operating system. The workaround isto continually monitor the cluster for nodes with wrong core counts.

Nodes appear offline after reboot – This intermittent problem was found to be related to arace condition between the munge and zabbix packages, resulting in bad file permissions. Thesystemd services were modified to ensure that zabbix-agent runs after munge. A similar racecondition between NTP and munge was detected and fixed.

1.5.3 OpenMPI

Failure to communicate between processes – this happens primarily due to Interconnect failure.The diskless approach ensures that the nodes all have a homogeneous configuration, so thisproblem was reduced to the hardware issues of each compute node, and of the head node. Sincethe head node runs the Open Subnet Manager, any Infiniband error in the head node translatesto errors in all the cluster. A script was developed to quickly verify the state of the Interconnectin all active compute nodes.

Low performance – There have been reports of low performance when using OpenMPI.One issue was related to using srun instead of mpirun. It was detected that the OpenMPIconfiguration lacked the PMI interface with SLURM. Another report was related to enablingUCX with a closed-source application, even though benchmarks showed improvement whenswitching to UCX. It was found that the installed Bull OpenMPI version lacked several fixesregarding UCX. An updated, open-source OpenMPI (3.1.2) was installed as an alternative.

Jobs terminating or freezing – There have been some reports of job failures, particularly whentesting the scalability of benchmarks over many compute nodes. These reports are still underexamination. The current evidence suggests many different root causes: a) execution errorsinducing unhandled kernel faults; b) low memory limits set by SLURM; c) extraneous processesrunning in the compute node. The memory limits were revised, and a cleaning script wasput in place to ensure that no processes from previous jobs linger for the next job. OpenMPIjobs freezing when using certain combinations of nodes and allocations is still a matter ofinvestigation, but the Bull OpenMPI 2.0.2 was exempt from this error.

1.6 Power Monitoring

Dibona power drain is monitored by HDEEM [2]. The High Definition Energy Efficiency Mon-itoring (HDEEM) library is a software interface used to measure power consumption of HPCclusters with bullx blades. Measurements are made via the BMC (Baseboard ManagementController) and a FPGA (Field-Programmable Gate Array) located on each compute nodemotherboard. The power monitoring devices installed on a Dibona node allow us to monitorthe power drain of:

• the global board;

• the two ThunderX2 CPUs;

9


• each DDR memory domain (four DIMM domains);

• the mezzanine board used for the InfiniBand interconnection.

In order to access the power devices installed on the Dibona board a modification to the BIOSand FPGA was required. Figure 2 shows the procedure used for the energy accounting measure-ments gathered for this deliverable. We use the GPIO signals to restart and stop data collectionin-band easily. Also, There is a ssh script that allows users to retrieve their measurements savedin the FPGA through the BMC.

Figure 2: Power monitor procedure for gathering energy measurements in Dibona.

As background, the FPGA is constantly monitoring the energy drain listed above with asampling rate of 1ms for the global board and 10 ms for the other sensors. The job scheduler(SLURM) running in the cluster offers the possibility of running a task prolog/epilog scriptbefore to start a job and right after the job ends on each allocated nodes. Also, Dibonaoffers another method to process actual power values at CPU level. This is provided by theThunderX2’s power management unit called M3. The request are performed using an IMPIimplementation to the correct address space.

10


2 Architectural Micro-benchmarks

This section analyzes the architectural features of the Dibona platform. We start by character-izing the memory subsystem in Section 2.1 with the microbenchmarks STREAM (to measurepeak memory bandwidth) and lmbench (to measure memory read latency); Section 2.2 stud-ies the performance of scalar and vector floating point operations on Dibona with a custommicro-benchmark called FPU µKernel. Section 2.3 analyses point-to-point and collectiveMPI communications between nodes using the Intel MPI benchmarks and OSU Micro-Benchmarks.

2.1 Memory Subsystem

This section analyzes the memory subsystem of a Dibona node. We evaluate memory bandwidthusing STREAM [3] and lmbench [4] benchmarks. Our study also includes a side-by-side com-parison with MareNostrum4 (MN4) 6. Table 1 shows a brief overview of the memory subsystemof each machine.

Table 1: Memory subsystem overview for Dibona and MareNostrum4

Machine CPU #cores #sockets L1 size L2 size L3 size Memory #channels Peak bandwidth

Dibona Cavium ThunderX2 32 2 32 kB 256 kB 32 MB DDR4-2666 8 341.33 GB/sMareNostrum4 Intel Xeon Platinum 8160 24 2 64 kB 256 kB 33 MB DDR4-3200 6 307.20 GB/s

STREAM is a simple synthetic benchmark to measure sustainable memory bandwidth.The program is structured in four distinct computational kernels: Copy, Scale, Add and Triad.The reference version of the benchmark includes a parallel version using OpenMP.

The kernels iterate through data arrays of double precision floating point elements (8 B)with a size fixed at compile time. It is required that the size of each array must be at leastfour times the size of the sum of all the last-level caches or ten million elements (the maximumvalue between the two).

E ≥ max 4 · S/8 , 10000000

Where E is the number of elements of each array (refered to as STREAM_ARRAY_SIZE in thecode); and S is the size of the last level cache in Bytes. Table 2 shows the minimum value for Eon Dibona and MareNostrum4. These are the values used to perform our tests. We also showthe compiler flavor and version we used on each machine as well as the compilation flags.

Table 2: E minimum values for Dibona and MareNostrum4

Machine S E Compiler Optimization flags

Dibona 33554432 16777216 armclang 18.3 -Ofast -mcpu=thunderx2t99MareNostrum4 34603008 17301504 icc 17.0.4 -xCORE-AVX512 -mtune=skylake

We run the benchmark by fixing the problem size to the minimum valid value of E foreach platform as reported in Table 2 and increasnig gradually the number of OpenMP threads.We report in this document the results of Triad as a representative scientific kernel since therest of the kernels have a similar behavior when increasing the number of treads. Threadsare pinned to cores by using OMP_PROC_BIND=true. We tried two different thread bindingpolicies: i) Contiguous, where threads are bound as close as possible; and ii) Interleaved, where

6https://www.bsc.es/marenostrum/marenostrum

11

https://www.bsc.es/marenostrum/marenostrum


threads are bound as further apart as possible. The first policy starts by filling one socketbefore moving to the second one while the second policy distributes the threads evenly acrossboth sockets. Figure 3 shows the achieved bandwidth with the contiguous policy and Figure 4shows the interleaved policy. The x− axis represents the number of OpenMP threads and they − axis indicates the maximum bandwidth achieved throughout 200 executions of the kernel.The figures also include two horizontal lines representing the theoretical peak bandwidth ofeach machine. Please note that the DDR technology is different: Dibona uses DDR4-2666,with a theoretical peak of 21.33 GB/s per channel; and MareNostrum4 uses DDR4-3200, with atheoretical peak of 25.60 GB/s per channel. In the case of Dibona, we run the benchmark withits base frequency of 2.0 GHz and with the Turbo mode enabled, which boosts the frequencyup to 2.5 GHz.

Figure 3: STREAM Triad Best bandwidth achieved over number of OpenMP threads in onesocket. Thread binding: Contiguous

Figure 4: STREAM Triad Best bandwidth achieved over number of OpenMP threads in twosockets. Thread binding: Interleaved

The contiguous binding saturates the memory bandwidth at around 100 GB/s in Dibonastarting at 8 OpenMP threads. With interleaved binding, Dibona reaches around 218.40 GB/s

12


(64% of the peak) in the Triad kernel when running with 64 OpenMP threads and Turbo whileMareNostrum4 gets 171.89 GB/s (56% of the peak) with 48 OpenMP threads. The achievedbandwidth saturates at 16 OpenMP threads for both machines.

lmbench is a collection of small and portable benchmarks designed by Larry McVoy andCarl Staelin. A full run of all the benchmark suite analyzes the machine’s hardware capa-bilities: integer and floating point performance, memory latency and bandwidth, disk accessperformance, network latency, and OS noise.

In this section, we report the results obtained with the ”Memory Load Latency” test withinlmbench. This test measures the memory hierarchy’s load latency at different cache levels. Thebenchmark takes a size S to work with. It then traverses arrays of varying sizes using differentstrides. The elements of the array create a ring of pointers which go backwards, as shown inFigure 5. The traversing of the array is done by derreferencing these pointers. Results arereported as nanoseconds per load operation. We run this test using arrays from 512 Bytes to

Figure 5: lmbench Memory Load Latency test ring of pointers. Each element of the array pointsbackwards S elements.

128 MiB and stride S ∈ 16, 64, 1024. We kept the default clock frequency of 2 GHz duringthe test. The results are shown in Figure 6. The x − axis represents the array size in KiBwhile the y − axis represents the memory latency in ns. Each colored line represents a runwith a specific stride S. The plot has three very distinct phases with stable read latency. Uponcloser inspection, we see that each phase corresponds with a memory hierarchy level. We addedvertical bars to the figure to indicate different phases.

Figure 6: lmbench Memory Load Latency. The benchmark exposes the read latency of eachlevel in the memory hierarchy. The CPU is running at the default 2 GHz.

The read latency of the L1 cache seems to be uniform across all values of S. For the L2cache, when using a stride of S = 16, the latency is around 3 ns while the other strides go up to5.5 ns. When accessing L3, the read latency is much greater for S = 1024. Going even further

13


up, the measured read latency does not show the same patterns as before. Since the memoryhierarchy is shared between cores from the L3 cache onwards, it is difficult to obtain stablemeasurements.

We can convert the measured read latency from nanoseconds to CPU cycles. This gives arough estimate of the cycles that it takes to access data at each cache level. Table 3 shows thelower and upper bounds of this measurement for the L1, L2 and L3 caches. Measurements fromL3 have a high deviation which may be an effect of being shared between cores.

Table 3: Read latency (in CPU cycles) of L1, L2 and L3

Cache level Min Max Avg Std Std/Avg

L1 4 4 4.0 0.0 0.0 %L2 6 11 9.7 2.0 20.0 %L3 10 75 41.4 22.7 54.83 %

2.2 Floating Point Throughput

We designed a micro-kernel to measure the peak floating point throughput of the machine.We call this code FPU µKernel and contains exclusively fused-multiply-accumulate assemblyinstructions with no data dependencies between them. The kernel has four versions distinguish-ing between i) scalar and vector instructions; and ii) single and double precision. The Dibonanodes are based on the Armv8 architecture with the NEON vector extension. The base ISAhas floating point instructions which accept single and double precision registers as operands.In this case, the kernel uses the instruction FMADD. The NEON extension is a vector ISA thatallows for 128 bit vector registers (two double precision data elements per register). The kerneluses the NEON vector instruction FMLA. In contrast, MareNostrum4 nodes are based on thex86 architecture with the AVX512 vector extension. Although the x86 ISA has floating pointinstructions that run on the FPU, it is recommended to use the more recent SIMD instructionsso the compiler will automatically translate a * b + c to VFMADD132SS or VFMADD132SDfor single and double precision, respectively. We implemented the SIMD version of the kernelto use AVX512 instructions VFMADD132PS for single precision and VFMADD132PD for doubleprecision. This means that the scalar version of the code in the x86 architecture will use vectorinstructions with the same behavior as scalar floating point instructions. The theoretical peakof the vector unit can be computed as the product of i) the vector size in elements (e.g., foursingle precision elements in NEON); ii) the number of instructions issued per cycle; iii) thefrequency of the processor; iv) the number of floating point operations made by the instruc-tion (e.g., fused-multiply-accumulate does two floating point operations). Table 4 lists these

Table 4: Theoretical peak performance of one NEON and one AVX512 vector unit in Dibonaand MareNostrum4

Instruction Precision Vec. Length Issue Freq. [GHz] Flop/Inst Peak [GFlop/s]

FMLA Single 4 2 2.00 2 32.00FMLA Double 2 2 2.00 2 16.00VFMADD132PS Single 16 2 2.10 2 134.40VFMADD132PD Double 8 2 2.10 2 67.20

parameters and the theoretical peak for Dibona and MareNostrum4 in both single and doubleprecision vector operations.

14


Figure 7 shows the results obtained on both machines. The FPU scalar unit of Dibonapeaks at 7.99 GFlop/s for both single and double precision. The NEON vector unit reaches31.95 GFlop/s and 15.97 GFlop/s for single and double precision operations, respectively, com-pared to the 131.67 GFlop/s and 65.79 GFlop/s of the AVX512. The vector length of theAVX512 ISA is four times as big as the NEON (128 bits) which yields the difference in sus-tained performance.

Figure 7: Sustained performance in one core of the four versions of the FPU µKernel

2.3 Networking

The Intel MPI benchmarks[5] (IMB) are a collection of performance measurements for MPIoperations. The benchmarks are divided into multiple components which measure MPI-1, MPI-2and MPI-3 functionalities. When executing each one of these components it is possible to selectwhich MPI functions to test. The OSU Micro-benchmarks[6] (OSU) are also a collection ofperformance measurements authored by The Ohio State University. This section evaluates thenetwork performance of Dibona, which uses Mellanox Infiniband EDR (IB) interconnect. Wealso include a comparison with MareNostrum4 which uses Intel Omni-Path (OPA) interconnect.We also report the PingPong results for Hazel Hen, a cluster located in HLRS with nodes basedon Cray XC 40 that use Cray’s Aries interconnect.

Our tests include: i) Point-to-Point communications, IMB’s PingPong and OSU’s osu bw;ii) Parallel transfer benchmarks, IMB’s MultiPingPong; and iii) Collective primitives, Allgatherand Alltoall for both IMB and OSU.

Figure 8: MPI process configuration in Intel MPI benchmarks. Left: PingPong, Center: Multi-PingPong, Right: Collectives

15


Single transfer benchmarks involve two active communicating processes. We used theIMB PingPong, which calls MPI_Send and MPI_Recv; and osu bw benchmark, which callsMPI_Isend and MPI_Irecv. The test consists of allocating two processes in different nodesof the machine as shown in the left side of Figure 8. This is a synthetic setup since there areonly two processes communicating at once. This avoids network contention and is a methodto observe if the communication approaches peak bandwidth and latency. Figure 9 plots theachieved throughput, y−axis, over the message size of the communication, x−axis. All pointsrepresent the average value of 100 repetitions of the communication. Throughput is computedby the benchmark itself following the definition: T = l/t Where T is throughput, in MB/s; lis the length of the message, in Bytes; and t is the time it takes the message to get from oneprocess to the other, measured in microseconds.

All three networks approach the theoretical peak while the message size increases. It seemsthat OPA is consistently achieving a better bandwidth than IB with message sizes over 256 KiB.The difference in bandwidth is also very noticeable at message sizes around 4 KiB and 8 KiB,where OPA almost doubles IB. In the case of Aries, the measured bandwidth drops drasticallywith a message size of 4 MiB.

Looking at the best case for each network, Intel’s OPA and Mellanox IB achieve almostpeak bandwidth (∼95%) while Cray’s Aries reaches ∼88% of the theoretical peak. Please notethat MareNostrum4 and Hazel Hen are production clusters with great network traffic. Thistranslates to a big variance in our measurements and a possible drop in sustained bandwidth.

It seems that measured bandwidth of OSU in Dibona stales around 8 and 16 KiB but thengoes up to 10 GB/s for larger message sizes. This behavior is consistent throughout multiplepair of nodes and between executions. We do not have an explanation for these measurementsat the time of writing this document.

Figure 9: IMB - Bandwidth between two pro-cesses in different nodes.

Figure 10: OSU - Bandwidth between twoprocesses in different nodes.

Figure 11 shows the previous plot zoomed in to message sizes between 1 B and 1 KiB. Inthis picture we find that OPA achieves consistently twice as much bandwidth as IB with IMB.In the case of OSU, which uses non-blocking primitives, it seems that the achieved bandwidthis the same on both machines. Aries consistently achieves ∼60% to ∼70% of the bandwidthmeasured on Dibona and MareNostrum4.

We repeated the PingPong tests for multiple pairs of nodes. Figure 13 and Figure 14 showthree heat-maps where the x−axis represents the first node in the pair; the y−axis representsthe second node in the pair, and each cell is color-coded to represent measured bandwidth. Wepresent the measurements for message sizes of 8 Bytes, 1 KiB and 4 KiB. There is a recurringpattern along the diagonal where pairs of nodes have higher bandwidth. This is due to thenetwork topology. The pairs of nodes with higher bandwidth are connected to the same switch(L1) as described in Section 1. Pairs of nodes that are physically farther apart achieve 10 %

16


Figure 11: IMB - Bandwidth zoomed in (smallmessage sizes).

Figure 12: OSU - Bandwidth zoomed in (smallmessage sizes).

less bandwidth than pairs of nodes that are close. We call the slower pairs of nodes Weak links.

Figure 13: Weak links in the Dibona network. Message sizes: 8 Bytes (left), 1 KiB (right)

To better understand how the saturation of the network affects MPI communications, wetested using a separate benchmark pairs of processes. For each pair pi of processes running ondifferent nodes, we measure ti, the time for exchanging x bytes with MPI Sendrecv calls. Wecan then compute the effective bandwidth as observed by the each of the pairs Bpi = 2x/ti. InFigure 15 we plot the cumulative bandwidth

∑64i=1Bpi for different values of message size x.

This study tells us how many MPI processes are needed to saturate the physical links amongnodes. For completeness, we analyzed message sizes of 8 B, 64 B, 1 KiB and 8 KiB, as repre-sentative message sizes of 1, 8, 128 and 1024 doubles. We also included bigger message sizessuch as 64 KiB, 512 KiB and 4 MiB.

Collective communications are also part of a typical HPC workload. In theory, all processesare communicating with each other at the same time forming a KP complete graph where p isthe number of processes. The actual MPI implementation may reduce the number of messagesto reduce network traffic. We run the collective benchmarks Alltoall and Allgather with multipleprocesses across multiple nodes. The setup is shown in the right part of Figure 8. Figure 16and Figure 17 plot the bare timings for a fixed message length of 1 KiB and increasing thenumber of communicating processes. Each figure reports the timings for the IMB and OSUimplementations of the benchmark. The x − axis shows the number of communicating MPI

17


Figure 14: Weak links in the Dibona network. Message size: 4 KiB.

Figure 15: Bandwidth saturation increasing the number of MPI processes with different messagesizes.

18


Figure 16: Allgather bare timings. Figure 17: Alltoall bare timings.

processes and the y − axis shows the measured time it took to complete the communication(in milliseconds). The Allgather primitive timings are below 1 millisecond. The performance ismostly equal for both benchmarks and machines. There is a high variability when running theOSU implementation on MareNostrum4 with 240 MPI processes.

The Alltoall timings range from around 1 ms up to 6 ms. There is no apparent difference inperformance between IMB and OSU. The time it takes to complete the communication increasesa lot faster in Dibona compared to MareNostrum4. With 384 MPI processes, Dibona andMareNostrum4 time at 5.6 ms and 2.6 ms respectively. This means that Dibona is 2.15 timesslower than MareNostrum4. Since we are using two different MPI implementations on thetwo machines (Intel MPI 2017.4 in MareNostrum4 and OpenMPI 2.0.2.14 in Dibona) and twodifferent networks we can guess that these factors cause the difference in the plot. Still, therelevant message in Figure 17 is the divergent trend of the blue (lower latency in MareNostrum4)and the red lines (higher latency in Dibona).

19


3 HPC Benchmarks and Proxy-apps

3.1 LINPACK

LINPACK [7] is a benchmark that solves a uniformely random system of linear equations andreports time and floating point execution rate using a standard formula for operation count. Itis widely used to benchmark supercomputers, and it is the main benchmark of the TOP500.

Table 5 shows the different configurations chosen for building LINPACK, used for all testsreported in the rest of the section. For all tests we used version 2.0.2.11 of OpenMPI and theHPL implementation of LINPACK version 2.27. Since the benchmark relies on linear algebrafunctions, we tested it with OpenBLAS v0.3.38 and an Arm-optimized implementation, calledArm Performance Libraries v18.4.19.

Table 5: HPL building options: compilers and flags.

Compiler Version Optimization flags

Arm HPC Compiler 18.4.1 -Ofast -mcpu=native -ffp-contract=fastGCC7 7.2.1 -O3 -mcpu=thunderx2t99 -ffp-contract=fastGCC8 8.2.0 -O3 -mcpu=thunderx2t99 -ffp-contract=fast

3.1.1 Parameter set

The performance of LINPACK is strongly dependant on the parameters used to configure theproblem size and its partition among compute nodes. Several recipes can be found for optimizingLINPACK performance, among others we followed the one described in [8].

The most important parameter to tune is the size of the matrix, N , corresponding to thenumber of elements (double floating point numbers) stored in the matrix. We modeled the Nas follows:

N ∼⌈√

s ·M/D⌉

where M is the total amount of memory of each compute node (in bytes); s is the fraction ofmemory we want to use (expressed as a number between 0 and 1); and D is the size of a doublein bytes. Several values of s have been tested. To maximize performance, it is recommendedto use as much memory as possible (so s ∼ 1), however, when the number of nodes increasesthe execution of the overall benchmark requires more memory, therefore if s is too large, it cansurpass the memory available in the machine. Experimental tests show that s = 0.75 allows usto reach a decent performance and at the same time scale well, so this value has been used forour analysis.

The same empirical approach has been used with the NB parameter, the block size usedfor the data distribution and computational granularity: after testing several values of NB wenoted that NB = 256 maximizes the performance on Dibona.

Since we are interested in weak scaling, as the number of nodes increases, also the N getsbigger in order to keep the workload per node constant. Each time the number of nodes doubles,N is multiplied by

√2 as seen in Table 6.

7http://www.netlib.org/benchmark/hpl/8https://github.com/xianyi/OpenBLAS9https://developer.arm.com/products/software-development-tools/hpc/

arm-performance-libraries

20

http://www.netlib.org/benchmark/hpl/

https://github.com/xianyi/OpenBLAS

https://developer.arm.com/products/software-development-tools/hpc/arm-performance-libraries

https://developer.arm.com/products/software-development-tools/hpc/arm-performance-libraries


Table 6: HPL matrix size (N) and block size (NB) parameters for the tested configurations.

N NB Nodes Cores MB/core

160256 256 1 64 3062226816 256 2 128 3066320512 256 4 256 3062453120 256 8 512 3059

Other two important parameters are P and Q, where n = P ·Q and n is the total number ofprocesses employed to run LINPACK. Different combination of these two parameters can leadto different results, so several combinations have been used throughout this preliminary test. Inaddition, when using multiple nodes, we also found that different compilers and libraries mayperform better with different values of P and Q.

Figure 18 shows the performance obtained with 4 Dibona nodes (256 cores) when testingthree different compilers (GCC7, GCC8, and Arm HPC Compiler) and two linear algebra li-braries (OpenBLAS and the Arm Performance Libraries) with different values for P and Q([P = 8, Q = 32]; [P = 16, Q = 16]; and [P = 32, Q = 8]). For all the results shown in therest of this section, all these possible combinations of compilers and linear algebra librarieshave been tested, along with different combinations of P and Q parameters (adjusted in eachcase to the number of running processes). Plots and tables will always show the best of thesecombinations.

Figure 18: HPL performance combining three compilers (GCC7, GCC8 and Arm HPC Com-piler), two linear algebra libraries (OpenBLAS and the Arm Performance Libraries) and differentcombinations of P and Q.

Other HPL parameters were tested thoroughly with one and two nodes. The best configu-ration for two nodes was kept for all the tests in the results section. A complete list of the HPLparameters can be seen in Table 7.

21


Table 7: HPL execution options: rest of algorithm parameters.

PMAP PFACT NBMIN NDIV RFACT BCAST DEPTH SWAP L1 U Equilibration

Row Right 4 4 Right 1ringM 1 mix transposed transposed yes

3.1.2 Results

The RPeak is the theoretical maximum performance of the system. It is represented asRPeak = C ∗ F ∗ fop, where C is the total number of cores, F is the core clock frequencyand fop is the number of floating point operations executed per clock cycle. This has beenmeasured and reported in Section 2.2. RMax is the highest LINPACK performance achieved.In Table 8 we can see the ratio RMax/RPeak, also known as %RPeak.

Table 8: HPL average performance results.

Cores RMax [GFLOPS] RPeak [GFLOPS] %RPeak Compiler + Library

64 859.36 1024 83.92 GCC7 + Arm PL128 1676.00 2048 81.84 GCC7 + OpenBLAS256 3349.20 4096 81.77 GCC8 + OpenBLAS512 6603.60 8192 80.61 GCC8 + OpenBLAS

1024 12280.00 16384 74.95 GCC7 + OpenBLAS

We observe that the %RPeak decreases from 83%, obtained when running with 64 cores, to75%, when running with 1024 cores. We also notice that the performance differences betweencompiler and library configurations spreads as we increase the number of cores.

Figure 19 shows a performance plot in GFLOPS obtained with different compilers andlibraries.

Figure 19: Performance of HPL using different compilers and libraries.

22


3.2 HPCG

The High Performance Conjugate Gradient (HPCG) benchmark complements the LINPACKbenchmark in the performance evaluation coverage of large High Performance Computing (HPC)systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recog-nized as a more representative benchmark for data-center and irregular memory access patternworkloads, therefore its popularity and acceptance has increased within the HPC communityto the point that now it is an official TOP500 benchmark.

3.2.1 Benchmark characterization

The HPCG benchmark solves a symmetric sparse linear system mimicking the typical behaviourof a finite element code on a 3D semi-regular grid. As explained in [9] the problem uses anadditive Schwarz preconditioner for a first domain decomposition, while each sub-domain ispreconditioned using a symmetric Gauss-Seidel sweep that is local to the sub-domain. In itscurrent implementation, the most important numerical kernel for HPCG performance-wise isthe 27-point stencil. The benchmark is split in the following phases:

Problem setup and validation - During this phase, all the needed data structures to exe-cute the benchmark are constructed: the memory is allocated to store the sparse matrix thatrepresents the 3-dimensional grid, the right-hand side vector and the result vector are also allo-cated. Processor and process topologies are generated and stored. Once everything is set up, avalidation process checks that the generated matrix fulfills the requirements of the benchmark.

Compute reference SpMV and SymGS - The reference algebraic kernels sparse matrix-vector(SpMV) multiplication and symmetric Gauss-Seidel (SymGS) are executed and benchmarked.The timings collected in this phase will be used afterwards to check the possible improvementsobtained by user optimizations.

Compute a reference conjugate gradient (CG) - The reference version of the CG algorithmis executed for a fixed number of iterations (50) and the reduction residual is stored. The samereduction residual needs to be reached by the optimized version implemented by the user. Thisimplies that the optimized version can use a different CG algorithm requiring a different numberof iterations to reach the residual. Performance of the optimized version will be in fact recordedindependently from the number of iterations whenever (and only if) the reduction residual isreached.

Set up optimized CG run and validation - The optimized CG is then executed and noted astCG. This timing is used to compute how many times the CG will be executed during the actualbenchmarking phase. The number of repetitions is computed as N = (rt/tCG) + 1. Where Nis the number of repetitions and rt is the execution time provided as input parameter by theuser. A validation is performed to make sure that, after all the modifications performed by theuser to improve the performance, the matrix still fulfills the requirements of the benchmark.

Optimized CG run - Finally, the HPCG is executed and its performace is recorded. Thisperformance is reported at the end of the execution.

Figure 20 illustrates how the recursive V-Cycle multi-grid algorithm is implemented in theHPCG benchmark. The idea is to define different levels of the same matrix and move toa lower/coarser level applying the combination of smoother (ComputeSYMGS), residual com-putation (ComputeSPMV) and restriction of residual (ComputeRestriction). The HPCGbenchmark implements four of such coarsening steps and then an interpolation process (alsocalled refinement) restores the finer grid, going up to the finest level in the V-Cycle (shownon the right side of Figure 20). Each refinement step is implemented performing two kernels:ComputeProlongation smoothed using ComputeSYMGS.

23


Figure 20: Representation of the V-Cycle multi-grid algorithm implemented in HPCG.

Figure 21: Breakdown of function calls during the conjugate gradient iteration (left) and per-centage of their execution time (right) when running the serial reference HPCG benchmark.

3.2.2 Profiling

To understand the HPCG reference benchmark behaviour, we used the profiling informationreported by the application itself. Our first study is based on executions with 8 OpenMP threadsof the original HPCG [10] with grid size nx=ny=nz=128, which corresponds to a problem sizeof ∼1.3 GBytes of memory per process.

Figure 21 (left) shows how calls from different compute kernels are related to each other.Figure 21 (right) reports the percentage of the total execution time spent in each kernel.

In our test, the multi-grid kernel (ComputeMG) consumes ∼85% of the execution timeper iteration. It calls multiple functions, including ComputeMG, recursively. The remain-ing ∼15% of the execution time, represented in Figure 21 is consumed by ComputeSPMV,ComputeDotProduct and ComputeWAXPBY.

3.2.3 OpenMP parallelization

Figure 22 shows a timeline of the different OpenMP parallel regions executed by HPCG, using8 threads each. On the x-axis we plot the execution time while in the y-axis we display threadsprogression. Different colors along the timeline mean different parallel compute region, whilelight blue regions translate directly into sequential parts of the code (i.e. parts where only onethread is actually executing code). We observe that most of the time the OpenMP threadsare idle and in our analysis we measured that ∼86% of the whole execution time the coderuns on a single thread. This is because the symmetric Gauss-Seidel, the most time consumingkernel of the benchmark, is a serial algorithm, therefore no OpenMP parallelization has beenimplemented in the reference version of HPCG.

24


Figure 22: Timeline of the OpenMP parallel regions during the execution of ComputeMG.

3.2.4 Suggested improvements

As highlighted in the profiling study, the scalability of the reference OpenMP version is mostlylimited by the serial preconditioning implementation. Parallelizing the preconditioning processusually implies a relaxation of the symmetric Gauss-Seidel algorithm that can introduce apenalization in terms of iterations needed to converge (i.e., trade-off between fewer but slowerand more but faster iterations). By applying the techniques introduced in this section, wemeasured between 5% and 40% more iterations depending on the chosen implemented methodand the parameters chosen.

The final benchmark performance, measured as number of floating operations per second,takes into account the execution time spent on the extra iterations required for convergence,due to the relaxation of the algorithm. However only the floating point operations of the first50 iterations are considered to compute the performance score.

Parallelizing the preconditioner also required the modification of various data structures atrun-time and the management of auxiliary variables. The portion of the execution time spentin such operations is considered when computing the final benchmark performance. From aproductivity point of view, code changes are confined in 4 files and those increment the totalnumber of lines by ∼2%.

3.2.4.1 Multi-color reordering

We start with the parallelization of the symmetric Gauss-Seidel kernel using OpenMP asdescribed in [11]. The core idea behind this technique is to color each node in the graph in away that nodes with the same color do not share edges among them. In the preconditioningprocess each node needs values only from first nearest neighbors in the graph. As a result,nodes with the same color can be processed in parallel. Figure 23 shows an example of howto color a 2 dimensional graph with 8 neighbors for each node. The initial graph is coloredand then reordered, such that nodes with the same color have consecutive indexes. In ourimplementation, we used 8 colors which is the minimal number of colors needed for a 27-point3-dimensional stencil [12].

The algorithm we chose for coloring is the greedy coloring described in [13]. The com-putational cost of the coloring process using this approach introduces an overhead of ∼1%,therefore negligible compared to the overall execution time. We are aware that this methodmay badly affect the cache data reuse: actually, as parallel accesses are performed on nodesthat are non-contiguously stored, the coloring process harms locality.

3.2.4.2 Multi-block color reordering

The idea behind multi-block color reordering is similar to multi-color reordering but insteadof coloring single nodes we color sets of nodes, called blocks. We used the method introduced

25


Figure 23: Example of coloring and reordering of a 2D regular graph with 16 nodes. Nodes notsharing edges can be colored with the same color and processed in parallel.

Figure 24: A. Example of multi-block color reordering of a regular 2D graph. B. 3D latticewith BS=1, C =2. C. 3D lattice with BS=2, C =2.

in [11] and applied in [14], but using different block geometry and colors. Once the size andgeometry of the block is defined, we treat every block as a single node with a connectivity listthat can be handled with the greedy coloring algorithm as mentioned in Section 3.2.4.1. Afterapplying this method, blocks with the same color can be computed in parallel, while nodeswithin the same block must be computed in a serial way. Figure 24 illustrates how the blockingis performed and how the blocks are colored in a 2D graph with regular nearest neighborsconnectivity.

This method improves the convergence of the algorithm compared to the simple coloring, asshown in [15], therefore fewer iterations are needed in comparison with multi-color reordering.It also improves the cache locality with respect to the multi-color reordering since each threadwill access consecutive rows of the matrix.

For the block partitioning, we decided to group nodes into blocks following a 2D slicetopology. BS is the block size, i.e., the thickness of the slice, while C represents the numberof colors. For exploratory purposes, in this study, those parameters are chosen at compilationtime, but by adapting the code they could also be set dynamically. The idea is to be able tooptimize the number of colors and the size of the blocks that better maps to the underlyinghardware and therefore provides a higher performance.

In Figure 24 we show an example using BS=1 (B) and BS=2 (C). In both cases the numberof colors is C =2. Varying these parameters has different effects on performance: increasing thenumber of colors is beneficial for convergence but decreases parallelism and imposes constraintson the geometry, because nz mod C should be 0 for keeping all threads busy all the time.Increasing the block size can have both cache benefits and fewer OpenMP synchronizations,but it may lead to inefficiencies during the recursion steps, especially when the lattice sizebecomes smaller.

26


3.2.4.3 Use of Arm-optimized compiler and libraries

HPCG is also a good candidate to use the Arm proprietary tools (including the Arm HPCCompiler and the Arm Performance Libraries, see Section 1.2). Most of the benchmark kernelscould be mapped directly to BLAS or LAPACK functions, for example the dot-product and theaxpy operations.

All dot product computations performed within the symmetric Gauss-Seidel kernel are alsogood candidates for being replaced by library calls. The vectors used for such operations arecomposed of the values of the neighbor nodes of a given row and the values of the solutionvector for each of the neighbor nodes of the same row, therefore the number of elements of thevector is at most 27. Unfortunately, we verified that the explicit use of math libraries on suchsmall vectors has no performance benefits on Dibona unless extra work is done to expose extraparallelism.

3.2.4.4 Vectorization

The symmetric Gauss-Seidel kernel does not take advantage of automatic SIMD vectoriza-tion. Since the floating point operations performed within the kernel consist mainly of dotproducts where one of the arrays is not contiguous in memory, the compiler does not generateSIMD instructions.

We introduced naive SIMD versions of ComputeSPMV and ComputeSYMGS which useNEON [16, 17] without significant performance improvements. The same version can be usedas a starting point to test different SIMD extensions such as the Arm Scalable Vector Extension(SVE) [18], which include gather-load and scatter-store instructions, useful for stencil workloadsas the one presented by the symmetric Gauss-Seidel algorithm.

3.2.5 Results and evaluation

The code development performance evaluation shown in Figure 25 and the scalability studypresented in Section 3.2.5.1 have been done on the Dibona cluster.

Our study is based on a local problem size of nx=192, ny=384 and nz=512, which corre-sponds to ∼26.6 GB of memory per process. Different values of nx, ny and nz are used tokeep the global problem size equal to this memory value for the MPI experiments (see Table9), as long as HPCG automatically performs weak scaling when increasing the number of MPIprocesses. The runtime parameter is always set to 600 (i.e., --rt=600).

In all OpenMP experiments, we set the environment variable OMP PROC BIND to ensurethat the runtime binds threads to cores, resulting in better data locality.

The software stack used for our tests included: GCC 8.2.0, Arm HPC Compiler 19.0, ArmPerformance Libraries 19.0.0, and OpenMPI 3.1.2.

For GCC we used the following flags: -O3 -mcpu=native -ffast-math -ftree-vectorize-ftree-vectorizer-verbose=0 -fopenmp -std=c++11 -funroll-loops.

For Arm HPC Compiler we used the following flags: -O3 -mcpu=native -ffast-math-fvectorize -fopenmp -std=c++11 -ffp-contract=fast.

Figure 25 shows the performance obtained for different OpenMP thread configurations ofthe multi-coloring version (green lines) described in Section 3.2.4.1 and the multi-block coloring(blue lines) introduced in Section 3.2.4.2. For comparison we report also the performance ofthe MPI-only implementation on one node (black dot) and the OpenMP reference version ofthe HPCG benchmark (in red lines) that, as expected, does not show any scalability.

27


Table 9: HPCG running parameters (nx, ny and nz values are chosen to keep the total memoryconstant at 26.6 GBytes).

Configuration MPI processes OpenMP threads Cores nx ny nz

OpenMP - 1 1 192 384 512OpenMP - 2 2 192 384 512OpenMP - 4 4 192 384 512OpenMP - 8 8 192 384 512OpenMP - 16 16 192 384 512OpenMP - 32 32 192 384 512OpenMP - 64 64 192 384 512

MPI 1 - 1 192 384 512MPI 2 - 2 192 192 512MPI 4 - 4 192 192 256MPI 8 - 8 96 192 256MPI 16 - 16 96 96 256MPI 32 - 32 96 96 128MPI 64 - 64 48 96 128MPI 128 - 128 48 48 128MPI 256 - 256 48 48 64MPI 512 - 512 24 48 64MPI 1024 - 1024 24 24 64MPI 2048 - 2048 24 24 32

Hybrid 2 32 64 192 192 512Hybrid 4 32 128 96 192 512Hybrid 8 32 256 96 192 256Hybrid 16 32 512 96 96 256Hybrid 32 32 1024 48 96 256Hybrid 64 32 2048 48 48 256

28


Figure 25: Performance of HPCG multi-block colored version, multi-colored version and MPI-only reference version.

The performance improvement is significant, even if the scalability is far from optimal. Themain issue with the multi-color reordering approach is the resultant low instruction per clockcycle (IPC). This is the result of two concurrent factors:

1. Due to the computation order imposed by the ordering of the matrix, almost all memoryaccesses on the symmetric Gauss-Seidel are to non-contiguous addresses. This impactsnegatively the cache miss ratio.

2. Due to the higher number of threads accessing the memory, the L2 is more stressed bothby a higher number of requests and a heavier coherency traffic.

To increase parallelism, multi-coloring implementation needs to reorder the computations. How-ever, it requires 20% to 38% more iterations depending on the geometry of the input set toachieve convergence, negatively affecting final overall performance.

Looking at compiler effects, we observed that the Arm HPC Compiler generates averagelya faster binary than GCC.

The reason for the bad scalability from 16 to 32 cores is the saturation of the memorybandwidth that can be seen in Figure 3 in Section 2.1.

3.2.5.1 Multi-node scalability

Figure 26 shows the strong scaling behaviour of HPCG when increasing the number of MPIprocesses beyond one node of the machine to up to 32 nodes (2048 MPI processes).

As it can be seen, HPCG presents pretty good scalability. The bad scalability shown between32 and 64 processes comes again from the saturation of the memory bandwidth when filling onesocket as explained previously. Note that in this experiment processes are bound interleavedbetween sockets in contrast with the previous section where, for a fair comparison with OpenMP,

29


Figure 26: Multi-node strong scaling of HPCG on Dibona.

MPI processes where bound contiguously (i.e., filling one socket with 32 processes before movingto the other).

Starting at 64 cores (i.e., one node), Figure 26 presents also the performance of a hybridconfiguration (1 MPI per socket with 32 OpenMP threads each, see Table 9) and the energy tosolution needed for both configurations. Table 10 includes these results together with the energyefficiency for both MPI-only and hybrid configurations. The OpenMP-only configuration fillingone node is also included for comparison purposes.

Table 10: HPCG energy efficiency.

MPI processes OpenMP threads Cores Time[s] Energy[J] GFLOPS GFLOPS/W

- 64 64 577.15 341945.66 5.36 0.01

64 - 64 623.75 226512.28 21.49 0.06128 - 128 616.13 406894.26 40.69 0.06256 - 256 611.14 769199.34 74.98 0.06512 - 512 610.97 1555857.25 185.54 0.07

1024 - 1024 610.49 2823893.47 295.45 0.062048 - 2048 631.46 7345205.05 559.44 0.05

2 32 64 616.48 191913.15 15.56 0.054 32 128 606.00 408494.93 34.83 0.058 32 256 598.89 692139.47 60.72 0.05

16 32 512 597.51 1601743.47 123.39 0.0532 32 1024 603.28 713796.41 225.09 0.1964 32 2048 604.84 2238493.63 296.88 0.08

30


3.3 Lulesh

LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics) is a highly sim-plified application, hard-coded to only solve a simple Sedov blast problem with analyticanswers, but it represents the numerical algorithms, data motion, and programming style typi-cal in scientific C or C++-based applications.

The code is built on the concept of an unstructured hex mesh, and it approximates thehydrodynamics equations discretely by partitioning the spatial problem domain into a collectionof volumetric elements defined by the mesh. A node on the mesh is a point where mesh linesintersect.

The default test case for LULESH is a regular Cartesian mesh, but this is for simplicity only.It is important to retain the unstructured data structures as they are representative of whata more complex geometry will require. The benchmark limits the number of MPI processes,as it must be the cube of a natural number (1, 8, 27,. . . ), and it also modifies the number ofelements to generate weak scaling according to the number of MPI processes. The total numberof elements equals the cube of the problem size (defined by the -s parameter) multiplied bythe number of MPI processes.

Table 11: LULESH Problem sizes used to transform weak- to strong- scaling.

MPI Processes Problem Size

1 1508 75

27 5064 (37.5) 37

125 30216 25343 (21.428) 21

In order to generate strong scaling results in our performance analysis tests, we have createda correspondence between the number of MPI processes and the required size (see Table 11)which produces an approximately constant number of elements (3,375,000) for all the executions.When the resulting problem size wasn’t an integer (numbers in brackets), the value was roundedto the closest integer.

3.3.1 Previous contributions

Throughout the Mont-Blanc 3 project, the benchmark has been the object of different studiesand transformations. Initially, we made a preliminary study using the automatic transforma-tions that the Mercurium compiler was able to apply to the original parallel for loops, convertingthem in their equivalent use of tasks. Next, we started a set of more significant changes in whichthese initial transformations were adjusted in an attempt to exploit specific OmpSs constructs.Deliverable D6.5 [19] Initial report on automatic region of interest extraction and porting toOpenMP4.0-OmpSs [19] discusses these techniques and transformations:

• Dependences in the taskloop construct;

• Reductions with indirections on large arrays;

• Memory allocation issues; and

31


• Increasing granularity by nesting tasks.

The main idea in all of them was to reduce the overhead of task creation and controlthe critical path of the application so that the lookahead of its execution could be increased.The primary objective of these adjustments was to improve the runtime system’s schedulerdecisions: firstly, advancing the execution of the critical path in those phases that OpenMPshowed a certain imbalance; secondly, improving data locality following the producer/consumerdependencies.

Earlier publications have also offered a performance analysis of OmpSs comparing the resultswith its OpenMP counterpart (D6.4 [20] and D6.5 [19]). We show similar behavior in some ofthem; we highlight some differences in others. The initial conclusion was that in OmpSs it wasmore complicated to manage the number of tasks and their granularities. The lack of tasksproduced insufficient work to feed all the available cores. When increasing the problem size, wesolved the degree of parallelism problem, but we also increased the imbalance of certain phasesof the application that have a considerable impact on the overall performance of the benchmark.

In this last phase of the study, we will present an additional parallelization technique, whichwas not documented in previous deliverables although it was already used: the precomputationof connectivity (see next Section). We will also show two new performance analyses on thenew platforms currently available for the project: MareNostrum4 and Dibona. The objectiveof these comparisons is to complete the overview of how the parallelization of this benchmarkbehaves on all these platforms and offer more information to extract more robust conclusions.

3.3.2 Precompute connectivity

Precompute connectivity allows determining the list of neighbor nodes for a given node in themesh. We can use this information to establish incompatibilities when executing operations inthe structure that involve directly connected nodes. We will need to wait until all scheduledoperations of all the neighbors’ lists of a given node have finalized before operations can bestarted for this specific node. We will also need to prevent the execution of simultaneousoperations over the same node in order to avoid race conditions. The dependence controlsystem will use this information to schedule tasks.

The precompute connectivity technique can be used in two functions that are part of thecritical section of LULESH (i.e., the most time-consuming phases in LULESH, according todeliverable D6.2 Report on regions of interest as mini application candidates [21]). Thesefunctions are IntegrateStressForElems and CalcFBHourglassForceForElems.

Listing 1 shows the inspector code. Firstly we group nodes in chunks to avoid having a hugenumber of connections between nodes. This variable is parametrized during the compilationby setting the NNB value (Number of Nodes per Block). Secondly, we compute an intermediatevector named will touch that will determine if any node of a given group interacts withany other node member of another group. To compute the will touch vector, we iterateover the group of nodes, having one of them as the current group and checking interactionswith the rest. This phase computes the number of interactions in between a pair of groups.Once we have computed interactions for the current group of nodes, we populate its list ofneighbors (e.g.,neighs FBH) and we also set the number of neighbors (e.g.,n neighs FBH)for the current group. Each group of nodes will have their list of neighbors with a variablenumber of elements.

The list of neighbors is computed only the first time that we execute each one of these targetfunctions and this information remains constant for the rest of the benchmark execution.

As the number of neighbors may vary between groups of nodes and we also need to usethis information to prevent the simultaneous execution of two tasks computing nodes with

32


Listing 1: LULESH OmpSs precompute connectivity before taskloop.

for (Index_t kk=0; kk<(numElem+EBS); kk+=EBS ) int e_block=kk/EBS;

// Declare and initialize will_touch array for this elementunsigned int will_touch[NNB+1];for (int i=0; i<(NNB+1); i++) will_touch[i]=0;// Compute will_touch arrayfor (Index_t k=kk; (k < kk+EBS) && (k <numElem); k++)

const Index_t* const elemToNode = domain.nodelist(k);for( Index_t lnode=0 ; lnode<8 ; ++lnode )

Index_t gnode = elemToNode[lnode];will_touch[gnode/NBS]++;

// Fill neighbours array using positions of will_touch_deps arrayint count=0;for (int i=0; i<(NNB+1); i++)

if (will_touch[i] > 0) neighs_FBH[e_block][0][count] = &will_touch_deps[i];count++;

n_neigh_FBH[e_block][0] = count;

interactions in between them, we require the use of multi-dependencies. Computational taskswill include the following clause:

commutative(*(neighs_FBH[i2/EBS][j]), j=0:n_neigh_FBH[i2/EBS]-1)

Tasks using the commutative clause over the same symbol (i.e., neighbor) are guaranteedto execute with mutual exclusion. The expression i2/EBS determines the group the resultingtasks are computing (created through a taskloop directive annotating a loop using i2 as controlvariable and with a grain size equal to EBS, Elements per Block Size). The iterator j used inthe commutative clause is declared to have integer values from 0 to the number of neighbors(i.e., n neigh FBH[i2/EBS]-1).

This pre-compute connectivity mechanism can be used jointly with the overlap of consecutiveloops described in previous deliverables. The loops, in this case, will be combined using thedependence information with respect to the precomputed neighbor array.

3.3.3 Performance analysis on MareNostrum4

In this section, we provide an application performance analysis using MareNostrum4. The maingoal is to compare the aforementioned OmpSs version against the original OpenMP one andalso extend the picture we already have of this benchmark from previous deliverables.

The application has been executed with a number of processes ranging from 1 to 343, usingsteps and problem sizes described in table 11. The number of threads per process has beendefined in slots of 1, 6, 12 or 24 cores (fitting the 48 cores available in each MareNostrum4’snode, i.e., 48, 8, 4 or 2 processes per node respectively).

Both codes (OpenMP and OmpSs) have been tested on MareNostrum4, using the followingtools and options:

33


• Compilers: IMPI 2017.4, gcc 7.1.0, mcxx 2.1.0

• Using O3 optimization

• OmpSs runtime: nanox 0.15a

• Running using the following command:

-cpu-freq=High -cpu_bind=cores

• Executed using 1, 6, 12 and 24 threads per process (when configuration available), due to48-core nodes.

In Figures 27 and 28 we can see the performance obtained on MareNostrum4 for the OmpSsand OpenMP versions respectively. Each line shows the evolution of performance for a givennumber of processes and programming model, when increasing the number of threads. Theperformance is meassured in z/s (an ad-hoc metric of LULESH that measures throughput).

Figure 27: OpenMP execution – Efficiency ofLulesh on MareNostrum4. Baseline perfor-mance 1215.29 z/s with one MPI process andone thread.

Figure 28: OmpSs execution – Efficiency ofLulesh on MareNostrum4. Baseline perfor-mance 1215.29 z/s with one MPI process andone thread.

On small core counts, there is little (or no difference) between the two programming models,but when we increase the number of cores (either by increasing the number of processes or thenumber of threads per process), the gap is wider. For example, if we take the 8 MPI processeslines (dark blue and light green) as a reference, we can see that when using 1 and 6 threadsper process (first two dots), OpenMP and OmpSs behave without significant difference. Onthe third dot (12 threads per process) there is the first significant difference, with OpenMPperforming significantly better (around 13%). These gaps get even broader at the last dot (24threads per process), where OmpSs performance decreases while OpenMP keeps increasing,causing a difference of over a 100%. In the case of 27 MPI processes (orange and purple lines),OmpSs falls behind OpenMP on the second dot (6 threads per process) by a 9%, and it is inthe third dot (12 threads per processes) where OmpSs performance drops.

As we increase the number of processes (and therefore of total cores), this behavior accen-tuates: OmpSs starts losing performance with lower numbers of threads, and the gap betweenOmpSs and OpenMP widens.

3.3.4 Performance analysis on Dibona

In this section, we provide a performance analysis of the application using Dibona. As inthe previous case, the main goal is to compare the behavior of the OmpSs version against theoriginal OpenMP one. With these results, we can complete the performance profile by includingone more platform. A secondary goal is to provide a deeper analysis of the Dibona system. All

34


applications in this deliverable have targeted this system to better understand how Dibonabehaves when executing different types of applications.

The results gathered in this section have been executed with a number of processes rangingfrom 1 to 343, using steps and problem sizes described in table 11 (the same used when executingin MareNostrum4). The number of threads per process has been defined in slots of 1, 8, 16 and32 cores (fitting the 64 cores available in each Dibona’s node, i.e., 64, 8, 4 and 2 processes pernode respectively).

OpenMP and OmpSs have been tested on Dibona using the following tools and options:

• Compilers: OMPI 2.0.2.14, gcc 7.2.1, mcxx 2.3.0

• Using O3 optimization

• OmpSs runtime: nanox 0.15a

• Running using the following command:

mpirun

– Map and binding options:

--bind-to core --map-by solt:PE=#threads

– Number of processes and additional flags:

-n #procs -mca pml ˆucx

• Executed using 1, 8, 16 and 32 threads per process (when configuration available), due to64-cores nodes.

In Figures 29 and 30 we can see the performance obtained for the OmpSs and OpenMPversions respectively on Dibona.

Figure 29: OpenMP execution – Efficiencyof Lulesh on Dibona. Baseline performance594.29 z/s with one MPI process and onethread.

Figure 30: OmpSs execution – Efficiencyof Lulesh on Dibona. Baseline performance594.29 z/s with one MPI process and onethread.

The trend is similar to the one shown for MareNostrum4, but with a more relevant differencebetween the two versions. In this case, irrespective of the number of processes, OmpSs per-formance starts falling earlier than OpenMP, and the difference between versions gets broader.The main reason for this behavior seems to be related to the task granularity, as Dibona seemsmore sensitive to the task creation overhead. A higher core count also affects the degree ofparallelism of some regions of code, where the task decomposition has been done with respectto at compile time-defined block size, rather than using the number of threads.

In this case we also include the energy to solution metric for each benchmark execution (seeTable 12 for OpenMP and Table 13 for OmpSs). We derive the energy efficiency by combiningthis metric with the corresponding thoughput and execution time (i.e., (z/s)/W ).

35


Table 12: LULESH energy efficiency for OpenMP.

MPI processes Threads Cores Time[s] Energy[J] z/s (z/s)/W

8 8 64 708.66 203062.27 25628.61 89.448 16 128 402.67 217706.42 45718.60 84.568 32 256 281.74 216278.26 64395.72 83.8827 8 216 226.45 234855.82 72410.53 69.8127 16 432 153.48 263021.70 119264.46 69.5927 32 864 157.16 444437.67 119005.33 42.0864 1 64 577.69 167229.05 29271.09 101.1264 8 512 133.57 241824.35 132968.71 73.4464 16 1024 107.46 397184.45 160656.04 43.4464 32 2048 127.55 906831.77 138235.04 19.44125 1 125 300.53 172570.33 58619.65 102.08125 8 1000 135.61 487665.67 137384.07 38.20125 16 2000 99.98 704577.99 160912.15 22.83216 1 216 205.63 221660.14 93192.97 86.45216 8 1768 124.38 739414.13 166975.36 28.09343 1 434 118.34 194468.78 141923.03 86.36

Table 13: LULESH energy efficiency for OmpSs.

MPI processes Threads Cores Time[s] Energy[J] z/s (z/s)/W

8 8 64 741.96 216845.08 24230.82 82.918 16 128 572.66 305941.40 31866.40 59.6527 1 27 1185.56 105494.75 12322.89 138.4827 8 216 325.82 357806.58 55709.17 50.7227 16 432 412.35 720692.90 43902.40 25.1264 1 64 642.78 190096.73 36154.47 122.2564 8 512 246.83 508267.48 70845.78 34.4064 16 1024 374.86 1208529.98 45843.99 14.21125 1 125 420.84 232163.16 42418.58 76.89125 8 1000 229.72 575475.74 78943.02 31.51125 16 2000 378.79 1232516.45 47945.19 14.73216 1 216 305.02 326293.56 58026.45 54.24216 8 1768 235.12 1599031.36 76958.19 11.31343 1 434 210.42 331825.21 81249.69 51.52

36


3.3.5 Conclusions

The benchmark has shown the importance of a correct task granularity policy. It involves taskdecomposition using dynamic schedulers which allow adjusting the task granularity based onthe total size of the problem and the number of hardware resources available at runtime. Insome cases, it would be convenient to apply a conditional parallelization (i.e., to define a cut-offstrategy) and create parallelism only when it is beneficial.

It has also been verified that external elements can help solve load imbalance problems.Executions carried out using the Dynamic Load Balance (DLB) tool in the previous phases ofthe project showed a benefit close to 10%.

The correct use of the allocation and deallocation operations has also been shown as a criticalelement in the behavior of the application. The imbalance introduced by buffer allocationincreases when the system calls this service concurrently. As this operation is in the criticalpath (it prevents the execution of the following tasks depending from this buffer), the impacton the overall execution time is considerable.

The benchmark can help study the trade-off between the degree of parallelism and taskgranularity. A better review of the techniques that can be applied in user space (i.e., by theprogrammer), and new runtime services allowing the decomposition of work on demand, wouldhelp to alleviate the effects of the overhead included by the tasking model by means of a betteruse of system resources.

37


3.4 Jacobi Solver

The Jacobi solver is a micro-application for exploring new hardware and/or software environ-ments and its basic algorithms are reused in the Algebraic Multigrid and the CARP code lateron.

−∇Tλ(x)∇u(x) = f(x) ∀x ∈ Ω

+ boundary conditions on ∂Ω

We consider this potential problem, wherein mainly Dirichlet boundary conditions are usedfor the experiments, but also mixed boundary conditions with Neumann and Robin are possi-ble (pure Neumann boundary conditions are avoided). The computational domain Ω ⊂ R2 isdiscretized with triangular elements and linear shape functions are used for the local approxi-mation in each finite element. This finite element approach ends in a linear system of equationswith a symmetric and positive definite n× n matrix K.

Ku = f

This system matrix K is sparse and unstructured, and therefore it is stored in the Com-pressed Row Storage (CRS) format. Other sparse storage formats can be realized easily.

uk+1 = uk + ωD−1(f −K · uk

)︸︷︷︸

r :=

The Jacobi iteration is performed with u0 = 0 until the relative error ( 〈rk,wk〉〈r0,w0〉 , where

w := D−1r ) is smaller than a given tolerance ε. The solver uses the scalar parameter ω = 1,D contains the diagonal entries of K and the inverse of this diagonal matrix is pre-computedbefore the Jacobi iteration.

The basic ingredients of the Jacobi iteration above are the (sparse) matrix-vector productand the scalar product 〈r, w〉. Having implemented and parallelized these operations opensthe opportunity for implementing more sophisticated iterative solvers such as Krylov/Lanczos-methods and multigrid methods, as we do in the Algebraic Multigrid and the CARP code.

3.4.1 Performance on Dibona

Benchmarking using the Jacobi solver is very useful because it is a kind of sandbox for AMGas well as for CARP. We are benchmarking a version of Jacobi where we use non-blocking MPIpoint-to-point communication for data exchange. We have executed the Jacobi solver on twodifferent platforms:

• Dibona ThunderX2 node, Armv8 64-cores.

• Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz

We have selected two different mesh sizes:

• 1024× 1024 for 1000 iterations.

• 4048× 4048 for 1000 iterations.

And different pinning strategies are tested:

38


• No-Binding

• OMP PROC BIND = close.

• OMP PROC BIND = spread.

• OMP PROC BIND = true.

For both Jacobi and Eikonal solvers, using OMP PROC BIND = close improved the perfor-mance around 3-5% on the Dibona cluster.

Single node performance measurements are taken on both ThunderX2 and Skylake proces-sors (see Table 14 and Table 15).

Table 14: Jacobi Solver on Dibona cluster single node performance - Mesh size 1000× 1000 for1000 iterations.

# Cores Time [s] SpeedUp Efficiency

1 29.92 1.00 1.004 7.31 4.00 1.0016 1.99 14.70 0.9164 0.70 41.81 0.65

Table 15: Jacobi Solver on Skylake single node performance - Mesh size 1000 × 1000 for 1000iterations.

# Cores Time [s] SpeedUp Efficiency

1 11.20 1.00 1.004 3.29 3.40 0.8516 1.01 11.08 0.69

The efficiency graph can be seen in Figure 31. First 32 cores executions use pure MPI, usingonly the first core in the first socket for each node and we see that we have a linear efficiency.As soon as we jump into 64-cores we introduce 2 OpenMP threads for each MPI process andwe see a drop of 12-15% in performance. After this point we use 1 MPI process for each socketand use 4 OpenMP threads, we see a superlinear speedup for larger problem size. After thispoint we use 8 OpenMP threads, clearly here we see that we are limited by memory bandwidthper node. We see only a small drop in performance (1-2 %) when we keep the same number ofOpenMP threads per MPI. The best performance for Jacobi solver is reached when we use 1MPI process per each socket with 4 OpenMP thread per MPI, after that point we see a dropin performance because we hit the memory bandwidth limit.

3.5 Eikonal Solver

Simulations in cardiac electrophysiology use the bidomain equations describing the intracellularand the extracellular electrical potential. Its difference, the transmembrane potential, is re-sponsible for the excitation of the heart and its steepest gradients form an excitation wavefront

39


Figure 31: Efficiency of Jacobi solver for two different mesh sizes using a total of 32 computenodes on the Dibona test platform.

propagating in time. This arrival time ϕ(x) of the wavefront at some point x ∈ Ω can beapproximated by the simpler Eikonal equation with given heterogeneous, anisotropic velocityinformation M(x). The Eikonal equation is better suited to the inverse problem because itsignificantly reduces the computational intensity of the bidomain equations. Considering thebidomain equation for inverse problems is almost impossible.

Several massively parallel algorithms for GPU computing are developed, including the do-main decomposition concepts for tracking the moving wavefronts in sub-domains and over thesub-domain boundaries. Furthermore, a low memory footprint OpenMP and CUDA implemen-tation of the solver is introduced. This reduces the number of arithmetic operations and enablesimproved memory access schemes. The CUDA implementation of the parallel algorithm reducesthe run time further such that interactive simulations on portable devices are possible. For acoarse model, the solver is transferred onto a tablet computer and other handheld devices forclinical use. To overcome the memory limitations when using larger meshes, the CUDA solveris extended for multiple accelerator cards using MPI.

3.5.1 Domain Decomposition Parallel Eikonal Solver

For large scale problems, the task-based parallel model will run into difficulties: there mightbe not enough (shared) memory on a single host or a GPU, the computing power of a singlecompute unit may not be sufficient, or the parallel efficiency may not be satisfactory. In allcases, a distributed memory model is needed. Hence, a coarser decomposition of the algorithmis needed, namely a domain decomposition approach.

The domain is statically partitioned into a number of non-overlapping sub-domains ω i (seeFigure 32), each of them assigned to a single processor. Synchronization and communication of

40


Figure 32: Domain decomposition. Computational domain ω and sub-domains ω i.

the processors is reduced to a minimum. In our case, a single processor i can efficiently solvethe Eikonal equation on ω i , as long as its boundary data on δω i is correct. However, this datamay belong to the outer boundary or other processors. Hence, inter-processor communicationis needed.

3.5.2 Numerical Tests and Performance Analysis

We present the results for the numerical tests in single precision performed on a workstationwith an Intel Core i7-4700MQ CPU @ 2.40GHz processor and an GeForce GTX 1080 GPU(NVIDIA Pascal). We use a coarser mesh of a rabbit heart with 3,073,529 tetrahedra and547,680 vertices and a finer mesh of the human heart, which contains 24,400,999 tetrahedraand 4,380,375 vertices. Results and analysis will be shown for the domain decomposition (DD)method, which includes the Gray-code improvements. In Figure 33, we visualize the solutionvalues and the arrival time φ(x) at each vertex x for the human heart mesh with multipleexcitation points. Let us compare the numerical results between the first and the second DDapproaches, and with the non-DD approach.

We focus especially on hardware limitations observed on the GTX 1080 GPU and how theycould be overcome.

Table 16 shows a comparison between our new implementation, including the Gray-codemethod and our old implementation without it. We achieved an acceleration of 35% for theOpenMP implementation and of 50% in the CUDA implementation by using the local Gray-code numbering of edges to reduce the memory footprint. We observe a similar behavior onrecent Intel CPUs, e.g., where the 24 million tetrahedra example takes 5.6 less seconds with 256threads on a KNL.

Table 17 compares the run times between both DD versions for the coarser mesh with 3 · 106

tetrahedra. There are no global memory limitations for running the smaller example on theGTX 1080, with 8GB global memory and 48 kB shared memory per block. All data fits intothe global memory by preallocation.

This means that we do not have to reallocate for each iteration. It is significant mostly in

41


Figure 33: Arrival time φ(x) ranging from 0 (blue) to 1 (red) in the Human Heart mesh withmultiple excitation points.

Table 16: Execution times in seconds on the workstation.

Implementations # Tetrahedra CUDA OpenMP8 threads

Without Gray-code 3,073,529 1.49 5.66With Gray-code 3,073,529 0.73 3.65

Without Gray-code 24,400,999 11.48 56.63With Gray-code 24,400,999 5.16 36.43

42


the second DD approach where the reallocation would happen dynamically using malloc withinCUDA code, which is an expensive operation.

Table 17: Execution times in seconds on the GTX 1080 for the coarser mesh.

# Sub-domains First DD Approach Second DD Approach

74 0.48 0.69160 0.52 0.60320 0.58 0.51

We observe that the second DD approach scales better with the increased number of sub-domains, until we get 0.51 seconds convergence time for 320 sub-domains. It is already fasterthan the version without DD, where the best time we achieved was 0.73 for the coarser mesh,as shown in Table 16. The reason is that the DD versions use the block scan implemented bythe CUB library instead of the device scan primitive as in [22].

Besides the block scan, we use block load and block store from the CUB library [23] forloading a linear segment of items from memory into a blocked arrangement across a CUDAthread block.The increased granularity ITEMS PER THREAD increases the efficiency. Performance is alsoincreased until the additional register pressure or shared memory allocation size causes SMoccupancy to fall too low. The block scan computes a parallel prefix sum/scan of items par-titioned across a block. If we decompose the domain in sub-domains small enough such thatthe block scan can use the available shared memory and the register pressure does not affectthe occupancy, then this method fits our DD approaches. Additionally, the block scan can becalled within a CUDA kernel allowing it to be incorporated into one big kernel for the secondDD approach. A device scan would require a separate kernel, and the drawback consists of theshared memory limitation.

The block scan requires shared memory, and the shared memory size limits the number ofsub-domains. A small number of sub-domains means larger sub-domains for the same mesh,and for this reason the data to be scanned for those sub-domains does not fit any more into theshared memory. Hence, we start the testing with 2700 sub-domains for the larger mesh and 74sub-domains for the coarser mesh. The idea is always to increase the number of sub-domains,since it improves the load balancing and therefore this limitation is not relevant for our code.

Another reason why we are faster using the block scan is that the work is shared betweenblocks, processing each sub-domain independently, which implies no synchronization betweenblocks is needed.

On the other hand, the device scan also shares the load in different blocks, since it has tocompute for the whole domain; however, it needs block synchronizations in order to get correctresults by joining and synchronizing the independent results of each block.

The shared memory usage, together with the consecutive items per thread used by theblock scan, allows us to conclude that the block scan is the reason that the DD approaches arefaster. We would like to emphasize that using the block scan is only possible because of the DDapproach.

One can notice from the convergence results for the finer mesh in Table 18, that the secondDD approach is not scaling any more. This result is due to the limited global memory of asingle GPU, such as the GTX 1080 in our case. While the global memory was large enoughin the coarser example to allow the preallocation of the memory space needed for each sub-

43


Table 18: Execution times in seconds on the GTX 1080 for the finer mesh.

# Sub-domains First DD Approach Second DD Approach

2700 5.96 6.893000 6.40 7.554000 7.55 9.468000 14.00 14.74

domain during the algorithm execution, it did not work out anymore for the larger example.As a consequence, the coarser example scales with the increased number of sub-domains asexpected. The GPU cannot provide enough memory to preallocate the needed space for eachsub-domain in the larger example. In order to get satisfactory results we still preallocate theneeded space for each sub-domain, but now we do it for each iteration, and we free that memoryonce a block terminates. It is done only for the active sub-domains, and no reallocation happensduring the kernel execution. In this way, the global memory suffices, and we managed to runthe DD approach within a single GPU. The DD version now is just one second slower than theversion without DD. Now we compare the results of Table 18 with the results shown in Table 16for the CUDA implementation with Gray-code. With the increased number of sub-domains,the dynamic allocation increased, becoming the limiting factor of the scalability. Without thislimitation, the algorithm would scale very well as it does for the coarser example.

In order to check the scalability of our DD approaches on different GPUs, we tested on anNVIDIA Titan X Pascal card with 24 multiprocessors (SMs), 4 more than a GTX 1080 cardand 4GB more global memory, but still not enough to preallocate. The results we get for themesh using 2700 sub-domains are approximately 20% faster than the results on a GTX 1080for the first approach, and 13% for the second approach. For the first DD approach, we geta convergence time of 4.68 seconds, and for the second approach, we get a convergence timeof 5.96 seconds. It scales worse for the second approach because of the scalability issue we geton the DD approaches for the larger mesh, which affects more the second approach.

3.5.3 Conclusions

The Gray-code numbering has significantly reduced the overall memory footprint of the Eikonalsolver, achieving performance improvements of 35% to 50%. The analysis showed that thisGray-code version decreased the non-coalesced access level and significantly increased the com-putational density on the GPU.

The domain decomposition approach solves the Eikonal equation on large scale problems.We managed to run the domain decomposition approach on one GPU by using two differentstrategies in CUDA. The first strategy makes better use of shared memory, specially for coarsermeshes, where we get a very good convergence time. However, it does not scale well with theincreased number of sub-domains, since its implementation contains many kernels, successivelyresulting in many host synchronizations and memory transfers between the device and the host.

The second strategy assigns one block to one sub-domain, avoiding host synchronizationand memory transfers nearly completely. This can be seen by the good scalability thanks todynamic preallocation of global memory for the coarser mesh. We still run into the globalmemory limitation for large-scale problems.

The domain decomposition approach is the first step towards the inter-process communica-tion implementation, where the limitation of the global memory will be entirely overcome by

44


using multiple accelerator cards and cluster computing. The next chapter will cover the firstversion of MPI-CUDA implementation, which will allow the preallocation of global memory andenable the scalability on large scale problems.

By testing on different GPUs with the Pascal architecture such as the GTX 1080 andTitan X, we concluded that our CUDA implementations, the DD approaches, and the non-DDapproach, all scale very well on different NVIDIA GPUs.

3.5.4 MPI-CUDA Cluster Implementation

The main idea here is to implement the MPI communication on top of the second CUDA DDapproach, using memory preallocation. Due to the assumption that in a GPU cluster there isalways enough global memory to run the solver, there is no point to implement the MPI ontop of the other version, where the memory is dynamically allocated for each iteration for eachsub-domain and freed once the block execution terminates. Dynamic memory allocation wasonly performed to make it possible to run the larger mesh onto one GPU. In a cluster, thisis no longer necessary. So, to achieve the scalability we previously lacked in one GPU for thelarger mesh, we use here the preallocated memory version of the DD to exploit the availablememory in a cluster and overcome the limiting factor of the scalability of the first approach.First, we distribute the domains to the GPUs or MPI processes: one GPU is managed by oneMPI process and computes for many sub-domains. No load balancing strategy is applied atthis moment and all MPI optimizations are left as future work. A host memory allocationbuffer is used in the same way as in the plain CUDA Eikonal solver to preallocate the memory.If more memory is needed, then the buffer increases its size by freeing the current memoryallocation and allocating more. The goal is to preallocate as much as needed to avoid extraCUDA memory allocation from the buffer, which would again affect the scalability of the DDapproach. Each GPU contains a global array of pointers where all the resulting pointers of thememory allocation for each sub-domains pertaining to that GPU are stored. This allows allthe blocks computing for a certain sub-domain within a GPU to be able to access the memoryallocated for the subdomain they are processing. Each GPU allocates this memory for eachsub-domain and stores the pointer to its global pointer array.

When all the memory is preallocated, and all the data is being transferred to the GPUs,the processing starts. Each GPU is processing independently for its subdomains, and for eachiteration, the interface data is exchanged for synchronization between the GPUs. The differencehere is the synchronization phase. There are two synchronization steps, the first one synchroniz-ing the domains within one GPU and the other one synchronizing between MPI processes andGPUs. The first synchronization phase is almost the same as in the second DD approach, wherewe synchronize between sub-domains in one GPU, but with the difference that if one subdo-main of the interface pertains to another GPU or MPI process, then the needed information ispacked and broadcast using MPI. Otherwise, it proceeds the same way. Before continuing to thenext synchronization phase, called the syncNodes kernel, the packed information is sent andreceived by the GPUs. The packed information contains the destination sub-domain index, thevertex index at the destination sub-domain and the solution value to be updated for that vertex.This information is broadcast to all MPI processes so that all the GPUs get this information.When the data is received, the GPUs call the syncNodes kernel. The information is global,and each GPU must identify its information by the domain index. This way, the domain indexof the packet is checked by each GPU, and if one of the GPUs finds out that it pertains to itssub-domain index list, then the GPU processes the packet. The processing happens only in casethe solution value that the packet contains is smaller than the current solution of the vertex,i.e., a shorter time travel is available for the vertex. If one node of the domain is updated with

45


the new solution when a better value is found, then it needs to be added to the active list forfurther processing. The destination domain must be made active as well.

Each GPU calls the activeNodes histogram kernel on its globalActiveList as in theversion without MPI. In order to get the total number of nodes in the global active list for allGPUs, we call an MPI Allreduce operation, which combines values from all the processes anddistributes the result back to all processes. In our case, it combines the globalActiveListvalues from all GPUs, computes the sum and distributes it back so all processes can safely runthe termination condition. All processes stop when there are no active nodes on the list for allprocesses. This is the termination condition. It means that the solver has converged and weneed to send the solution data from all the MPI processes to the master. Remember that theindexing in the sub-domains is based on the local-to-local mapping system, so the indexing islocal, and after converting it to global, we can finally work with the solution values. Finally,we check if the result is correct and after that generate the vtk file for visualization.

3.5.5 Numerical Results and Conclusions

We present numerical results performed on a BSC cluster, where we use cluster nodes containing4 Tesla GPUs each. GPUDirect Peer-to-Peer (P2P) and GPUDirect remote direct memoryaccess (RDMA) are supported, as explained later on. The mesh we are using contains morethan 11 million tetrahedra elements and more than 2 million vertices, four times finer than theTbunny C rabbit heart mesh.

Table 19: Execution times in seconds on the Tesla K80 for the finer mesh.

# Sub-domains 1 Tesla GPU 4 Tesla GPUs (P2P) 8 Tesla GPUs (RDMA)Dynamic mem alloc [s] Mem prealloc [s] Mem prealloc [s]

256 6.52 5.32 5.41512 6.45 4.19 4.861024 6.69 3.58 3.772048 7.71 2.90 3.55

The table compares the results between the DD version with dynamic memory allocationduring each iteration, which makes it possible to run the second DD approach on one GPU, andthe DD version using preallocation in a cluster of GPUs. Since the finer mesh did not fit intothe GPU global memory by preallocation, we implemented the dynamic memory allocation ateach iteration for each sub-domain and freed that memory once the block execution terminates.It enabled the execution of larger examples in one GPU but disabled the scalability of theapproach. The same is shown in Table 19, where we tested for the finer mesh on one GPU, withthe difference that here instead of the GTX 1080 we use the Tesla K80 GPU. The convergencetime is increased with the increased number of sub-domains, as one can see in the secondcolumn. This is because the dynamic memory allocation increases with the increased numberof sub-domains, becoming in this way the limiting factor of the scalability. The third column ofthe table shows the DD version where the memory is preallocated. It is possible to preallocatenow, since we use the memory of four GPUs instead of the limited available memory of oneGPU. We run it first on 4 Tesla K80 and then on 8 Tesla K80 GPUs.

The first test case is very important, since we use 4 GPUs in one single node. We dothat to reduce the effect of the non-optimized MPI implementation and avoid the networkcommunication which might affect the performance and limit the scalability of the DD approach

46


with respect to the number of sub-domains in a cluster. In our cluster, the BSC Mino-Taurocluster, one node contains only 4 Tesla K80 with 11.17 GB of global memory, 2496 CUDA coresand 48 kB of shared memory per block. The cluster supports the GPU direct technology, whichprovides high bandwidth and low-latency communications with NVIDIA-GPUs. To utilize suchtechnologies, we use CUDA-aware MPI libraries. Since the first test case takes place withina cluster node, the GPUDirect supports the intra-node inter-rank MPI communication. Thevariant of the GPU direct used is the Peer-to-Peer (P2P) transfer model. The cluster supportsit, and the data can be copied directly from one GPU memory to another without going at allthrough the CPU. This accelerates the MPI communication significantly and helps us to clearlysee the effects of increased global memory availability provided by the cluster in the scalabilityof the DD approach. The third column illustrates this case. By comparing the results on oneGPU using dynamic memory allocation - second column - with the results on 4 GPUs usingmemory preallocation technique - third column - one can immediately notice the improvementon scalability that follows the memory preallocation technique. Such improvement can not beseen using one GPU, due to the limited memory which affects scalability. In a cluster, wehave enough memory, and this proves that the DD approach using preallocation of the memoryscales very well with respect to the increased number of sub-domains when there is enoughglobal memory available.

What one does not see is a good scalability with respect to the number of GPUs. Thereis already some scalability which is seen at best in the last row, using 2048 sub-domains. Inthis case, the scaling is almost a factor of 3 if one compares the second with the third column.This is acceptable considering that no MPI optimizations have been conducted. With the MPIoptimizations that we are going to discuss briefly in this chapter, whose implementations remainfuture work, the scalability with respect to the number of GPUs would improve significantly.It seems obvious, and some proof of it is contained in the last column of Table 19. Here wetest using 8 Tesla GPUs, now using the GPUDirect RDMA technology. GPUDirect RDMA isa technology introduced in Kepler-class GPUs and CUDA 5.0 that enables a direct path fordata exchange between the GPU and a third-party peer device, using standard features of PCIExpress. In our case, the third-party peer device is the network interface. The data buffersare going to be copied from the GPU memory to the network adapter again without goingthrough the CPU. It means that no host memory copy will be issued and carried out by theCPU. However, going through the network translates into more latency induced on the MPIcommunication, even though it is a GPUDirect RDMA accelerated communication. Due to thenon-optimized MPI implementation, this overload is more dominant in the scalability of theDD approach. This is the reason that in the last column of Table 19 we do not see the samescalability factor as in the case where we test within a cluster node. It proves that currentlythe MPI communication is a significant dominant factor and its improvement could boost thescalability considering the number of GPUs, but also analyzing the number of sub-domains. Inall cases, a cluster implementation of the Eikonal solver is working well to solve large examples.Of course, further improvements are necessary, but they are left as future work.

3.5.6 OpenMP Performance and Scaling Results

In order to benchmark the Eikonal solver, the Rabbit heart mesh with three million tetrahedrais used.

We tested our code on:

• Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz.

• Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz.

47


• Dibona ThunderX2 node, Armv8 64-cores.

Single node performance is reported for both, ThunderX2 and Skylake processors (see Ta-ble 20 and Table 21).

Table 20: Eikonal Solver on the Dibona cluster single node performance - TBunnyC i mesh

# Threads # Time #SpeedUp #Efficiency

1 38.52 1.00 1.002 22.86 1.68 0.844 11.87 3.24 0.818 6.02 6.39 0.7916 3.19 12.07 0.7532 1.58 24.37 0.7654 1.11 34.70 0.64

Table 21: Eikonal Solver on Skylake single node performance - TBunnyC i mesh

# Threads # Time #SpeedUp #Efficiency

1 15.12 1.00 1.002 9.32 1.62 0.814 4.92 3.07 0.768 2.27 5.55 0.6916 1.40 10.80 0.6732 0.76 19.89 0.6254 0.53 28.52 0.52

On the Intel Phi (KNL) node, the OpenMP implementation scales almost linearly until 64threads, as seen in Figure 34. Then the performance drops, due to the hyper-threading. Weachieve similar scaling results for the larger mesh, for which the algorithm converges in 5.60 sec-onds using 256 threads. It is a comparable result with the CUDA implementation where weachieved a convergence time of 5.16 seconds.

Figure 35 shows the efficiency of the Eikonal solver on two different Armv8 processors, with24 and 32 cores respectively. Efficiency is a bit better on the 32-cores processor, but in bothcases we see a drop in efficiency when we jump from the first to the second socket.

For a comparison of efficiency for the Eikonal solver on different platforms, see Figure 36.On the Dibona platform, efficiency is acceptable until 32 cores of the first socket are used, afterwhich it drops down to 50-60 %.

48


Figure 34: Scaling results on Intel Xeon Phi 1.3 GHz (KNL) for the coarser mesh.

Figure 35: Eikonal solver efficiency on Dibona.

49


Figure 36: Efficiency comparison of Eikonal solver on different platforms.

50


3.6 RBF Interpolation Solver

The radial basis function (RBF) interpolation solver is a mini-app extracted from the AVL Firecodebase to quickly evaluate new hardware or programming models. It represents a substantialsubproblem of computational fluid dynamics on moving or deformable meshes. For a thoroughdescription see [24].

The porting of this application to the Arm based mini-clusters was initially described in[20, 25]. Two contributions were made to enable the application to efficiently use the mini-clusters. The pure MPI code was made into a hybrid code OpenMP, to make use of theunusually high core counts of the mini-clusters. A autotuning step was introduced, becausethe code contains tuning parameters which were previously kept constant, but tuning themspecifically to the mini-clusters vs. a traditional HPC machine accounted for a 50% reductionin run time. A variable depth domain decomposition scheme is used to generate tasks whichcan be computed independently in parallel; note, however, that splitting the domain into moresegments requires more buffer cells, and increases the amount of work to be done. This isillustrated in Figures 37 and 38.

Figure 37: Illustration of first level of thedomain decomposition of the input dataand second level of the multi-level domaindecomposition of the input data.

Figure 38: Second level domain decompo-sition. The second color per domain marksthe buffer cells for halo exchange.

3.6.1 Experimental Setup

For the evaluation on Dibona the following libraries were used:

• module load gcc/7.2.1

• module load openmpi4.0.0/gnu7

• module load arm/performance-libs/19.0.0

The mpic++ wrapper was used as compiler and linker command with the flags -march=native-mcpu=thunderx2t99 -O3 -ffast-math -ffp-contract=fast -larmpl -fopenmp.

Comparison of node level performance versus the ThunderX Mont-Blanc mini-cluster and aBroadwell series Intel(R) Xeon(R) E5-2690 v4 are made using data published in [20, 25].

51


3.6.2 Results

First, the mini-app was tested for correctness and parameters optimized for the Dibona platform.Interestingly, the break even point between the brute force solution of subproblems, and a fastmultiple method (FMM) has shifted from the mini-cluster to an even larger value than on theIntel Xeon system. The relative performance of all three systems is shown in Figure 39. Thisindicates a stronger relative performance in the dense linear algebra work, i.e. BLAS used inthe brute force solution, versus more branching and pointer chasing workloads which are usedin the FMM variant. In this regard, Dibona has very similar characteristics to a typical HPCsystem.

5000 10000 15000 20000 25000# Points

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

2.50

Rel.

Perfo

rman

ce F

MM

/ Br

uteF

orce

XeonThunderXDibona

Figure 39: Relative Performance of a Fast Multipole Method (FMM) based kernel over BruteForce kernel for Xeon and ThunderX test system on a single core. The break-even point forDibona is at 16000 points.

Figure 40 shows the run time of a small example and a large example on up to a node of eachof the three tested systems. On the small example, a Dibona node is 2.5 - 4.0 times slower thanthe Xeon system and 2.3 - 6.2 times faster than the ThunderX system core-per-core. Comparingfull nodes, Dibona is 5% faster than the Xeon system and 5.7 times faster than the ThunderXsystem. On the large example, a Dibona node is 2.5 - 3.5 slower than the Xeon test systemcore-per-core, and 20% slower when comparing the full nodes. The ThunderX mini-cluster wasexcluded, for impractically long run times.

52


101

102

103

104

Tim

e t

o S

olu

tion [

s]

Small Problem (#Points = 262144)

ThunderX

Dibona

Intel Xeon

1 2 4 8 16 32 64

Cores

102

103

104

Tim

e t

o S

olu

tion [

s]

Big Problem (#Points = 1048576)

Dibona

Intel Xeon

Figure 40: Run time for a small and a large problem, scaling up to one node of each test system.

53


4 Scientific applications

4.1 Alya

Alya [26, 27], the HPC CFD code developed at BSC, was analyzed in detail in deliverableD6.1 [24]. The analysis suggested to address two code performance issues. On one hand, theload imbalance of some parts of the code. On the other hand, the approach used to implementreductions over large arrays with OpenMP.

Following these suggestions, two runtime techniques were applied to Alya: Multidependencesand Dynamic Load Balancing (DLB). In deliverable D6.5 [19], these techniques were explainedand an evaluation on an Intel-based cluster was presented.

In this section, we will extend the evaluation of the Multidependences and DLB usingfour different architectures including the test platform (Dibona) and one of the mini-clusters(Thunder).

4.1.1 Runtime techniques summary

In this subsection, we include a brief explanation of the runtime techniques that we have usedin our experiments, and further details on the techniques can be found in deliverable D6.5 [19].

4.1.1.1 Multidependences The multidependencies approach consists of using two newfeatures that were added to OpenMP 5.0 [28]. Firstly, the iterators feature is used todefine the list of dependencies, specifically, it enables the user to dynamically, at executiontime, define the number of task dependencies. Secondly, there is the mutexinoutset feature,which defines the relationship between two tasks that the two tasks cannot be executed atthe same time (the execution order is irrelevant). Using multidependencies, we can handlereductions on large arrays efficiently. Figure 41 covers the different parallelization approachesthat have been evaluated.Elements Parallelizationomp parallel do+omp atomicomp parallel do

omp taskmutexinoutset(iterator)

Atomics

Coloring

Multidependen

ces

omp parallel doomp parallel doomp parallel doomp parallel do

Figure 41: Parallelization approaches for re-ductions on large arrays

Figure 42: Left: Unbalanced hybrid applica-tion. Right: Hybrid application balanced withDLB

4.1.1.2 Dynamic Load Balancing (DLB) Dynamic Load Balancing (DLB) [29, 30, 31]is a library that aims to improve the load balance of hybrid applications. In an applicationleveraging multi level parallelism, e.g., MPI+OpenMP, DLB uses the second level of parallelism

54


(usually OpenMP) to improve the load balance at the MPI level and achieve better overallperformance.

Figure 42 shows the behavior of DLB when load balancing a hybrid application. On the leftside we can see an unbalanced MPI+OpenMP application with 2 MPI processes and 2 OpenMPthreads per process. On the right side we can see the same execution when load balanced withDLB. We can observe that when MPI process 1 reaches a blocking MPI call, it lends its resourcesto MPI process 2. At this point MPI process 2 will be able to use 4 OpenMP threads and finishits computation faster. When finishing the MPI blocking call, each MPI process recovers itsoriginal resources.

4.1.2 Experimental setup

4.1.2.1 Use case: The human respiratory system and particle transport For thisevaluation we will simulate the transport of particles injected in an unsteady flow in the humanlarge airways during a rapid inhalation. This kind of simulations need to solve two differentphysics operations: the speed of the fluid (the air going through the human airways), and theparticle transport.

...

…

FluidParticle

ParticleFluidParticleFluid ParticleFluid

n MPI processesf MPI processesp MPI processes

ParticleFluid Fluidf’ MPI processesp’ MPI processes ParticleCoupled

Send velocity... ......

Step 1 Step 2Step 1 Step 2

Step 1 Step 2

Synchronous

n = f + pn = f’ + p’

Figure 43: Execution modes for CFPD simulations with Alya

In Figure 43 we can see the different options to run this kind of simulation. At the top wecan see the synchronous execution, where all the processes first solve the velocity of the fluidand then the particle transport. At the bottom, the coupled execution is represented; in thiscase some MPI processes will solve the velocity of the fluid and send them to the MPI processesthat are solving the particles transport.

When using the coupled simulation, the user can decide how many processes will be assignedto solve the fluid and how many processes will solve the particle transport. Depending on thisand the number of particles injected, the more heavily-used processes will be the ones solvingthe fluid or the ones solving the particles. This is depicted at the bottom of Figure 43, wheref > f ′ and p < p′. The conclusion is that depending on the decision taken by the user, theperformance of the simulation can vary. The optimum distribution of MPI processes dependson simulation parameters, the platform and the number of hardware resources used.

4.1.2.2 Platforms and environment In this section we will present results obtained infour different platforms:

MareNostrum4 is a supercomputer based on Intel Xeon Platinum processors, Lenovo SD530Compute Racks, a Linux Operating System and an Intel Omni-Path interconnection. Its

55


general purpose partition has a peak performance of 11.15 Petaflops, 384.75 TB of mainmemory spread over 3456 nodes. Each node houses 2× Intel Xeon Platinum 8160 with 24cores at 2.1 GHz, 216 nodes feature 12 × 32 GB DDR4-2667 DIMMS (8GB/core), while3240 nodes are equipped with 12× 8 GB DDR4-2667 DIMMS (2GB/core).

Thunder this is one of the mini-clusters of WP4. It is composed of four computational nodesintegrated in a 2U server box. Each node features a dual socket motherboard housing2× Cavium ThunderX CN8890 Pass1 SoCs with 48× custom Arm-v8 cores at 1.8 GHzin each socket, 128 GB DDR3 at 2.1 GHz per node and 256 GB SSD for local storage.Nodes are air cooled and interconnected using a single 40 GbE link.

Power9 is a cluster based on IBM Power9 processors, with a Linux Operating System andan Infiniband interconnection network. Each node has two 8335-GTG, 512GB of mainmemory and 4 GPU NVIDIA V100 (Volta). The cluster has 52 nodes.

Dibona is the project’s test platform, described in detail in Section 1.

For this evaluation we use the OmpSs implementation of Multidependences because they arenot available yet in any OpenMP implementation, due to its recent inclusion into the standard(October 2018).

In Table 22 we present the compilation and execution environment used in each one of theplatforms used in this section.

Table 22: Environment used in the different platforms

Platform Compiler MPI version OmpSs DLB

MareNostrum4 GCC/8.1.0 OpenMPI 3.0.0 17.12.1 2.0.2

Thunder GCC 5.3.0 OpenMPI 3.0.1 17.12.1 2.0.2

Power9 GCC 8.1.0 OpenMPI 3.0.0 17.12.1 2.0.2

Dibona GCC 7.2.1 OpenMPI 2.0.2.14 17.12.1 2.0.2

All the experiments comparing the different platforms have been executed using 2 nodes ofeach machine. All the results are obtained averaging 10 time steps.

4.1.3 Multidependences evaluation

In this section we evaluate the performance of Multidependences compared with the implemen-tation using a coloring algorithm or atomic pragmas to avoid the race condition when reducingover a large array. We will evaluate the impact on performance in two phases of the simulation,the matrix assembly and the subgrid scale (SGS).

The benefit of using Multidependences in the matrix assembly is to avoid the use of atomicand preserve spatial locality. In the case of the SGS, no update of a shared structure is involved,therefore, there is no need to use atomic pragmas. Nevertheless, we will evaluate the perfor-mance in this phase in order to see the overhead added by Multidependences.

We have executed three different versions of each simulation, using atomic pragmas labeledas Atomics, a coloring algorithm labeled as Coloring or the Multidependences implementationlabeled as Multidep. For each version we have executed different combinations of MPI processesand threads, with 1, 2 or 4 threads per MPI process. In the charts, the combination is shownas: Total number of MPI processes × Number of OmpSs threads per MPI process.

56


In this section, we show the speed-up obtained by each hybrid execution with respect to thepure MPI version using the same number of nodes in each cluster (i.e., running with 96 MPIprocesses on MareNostrum4 and 192 MPI processes on Thunder). Within the same cluster,we compute the speedup S as: Sc = tM/tc, where tc is the time spent for simulating a givenproblem with the configuration c of MPI processes and OmpSs threads and tM is the time spentfor simulating the same problem using a pure MPI implementation.

Table 23: Elapsed time in each phase of Alya of the pure MPI version in 2 nodes of each cluster

Platform Assembly [s] Subgrid Scale [s]

MareNostrum4 6.75 4.16

Thunder 19.76 12.55

Power9 9.30 6.45

Dibona 7.48 4.72

Table 23 contains the elapsed time in the assembly and subgrid scale phases of Alya whenrunning with the pure MPI version in 2 nodes of each cluster. These are the values used asreference for each cluster to compute the speed up of the different hybrid versions.

Figure 44: Speedup w.r.t. pure MPI for different parallelizations of Alya assembly phase

In Figure 44 we can see the speed-up of each parallelization with respect to the pure MPIversion of the matrix assembly. On the y-axis we show the speed-up and in the x-axis thedifferent configurations of MPI processes and OmpSs threads.

In this chart we can observe that using atomic pragmas has a negative impact on allplatforms. But also that the negative impact of the atomic pragmas is much lower on theArm-based clusters (Dibona and Thunder) than on MareNostrum4 and Power9.

The coloring version also has a negative impact on almost all the configurations and plat-forms, having a worse impact on MareNostrum4 and Dibona.

The Multidependences implementation is the best option in all cases. On Dibona andPower9, it obtains a better or equal performance than the pure MPI version for all configura-tions. On the other hand, on MareNostrum4 and Thunder only one of the configurations whenusing Multidependences performs better than the pure MPI on the same cluster.

Figure 45 shows the speed-up in the Subgrid Scale of the different OmpSs parallelizationswith respect to the pure MPI version. On the x-axis we can see the number of MPI processesand OmpSs threads used and on the y-axis the speed-up.

57


Figure 45: Speedup w.r.t. pure MPI for different parallelizations of the Alya subgrid scalephase.

For the Subgrid Scale the best option is to use the version with the atomic pragmas becausein the case of the subgrid scale there is no update of a large array and the atomic pragmasare not necessary. Nevertheless, it is still interesting to see what the impact of the coloringversion is, which has worse data locality, is on each platform. The data locality seems to have ahigher impact on MareNostrum4 and Power9 than on Thunder and Dibona. Also, the overheadintroduced by the use of Multidependences is more relevant on Dibona and Thunder than onthe other clusters.

4.1.4 DLB evaluation

To evaluate the impact of using DLB on the performance of coupled codes, we run two kinds ofsimulations: in one of them 4 · 105 particles are injected in the respiratory system, and in theother one 7 · 106 particles are injected.

With these two simulations we represent two different scenarios: one where the main com-putational load is in the fluid code, injecting only 4 · 105 particles; and another one where themain computational load is in the particle code, injecting 7 · 106 particles.

All the experiments executed in this section were obtained using the Multidependencesversion of the code to solve the matrix assembly and the atomics version for the subgrid scale,because in the previous section we used the versions that obtained the best performance in eachphase. Also, all tests have been performed using one OmpSs thread for each MPI process intwo nodes of each cluster. The processes were launched in an interleaved scheme to distributethe processes solving the particles p and the processes solving the fluid f among the two nodes.

As explained in Figure 43, this simulation can be executed in a synchronous or coupledmode. When running the coupled mode, the number of processes assigned to the computationof the fluid f and the number of processes assigned to the computation of the particles p mustbe decided by the user. We present experiments using both modes and varying f and p whenrunning coupled simulations.

In Figures 46 and 47 we can see the average execution time in seconds of one time stepon MareNostrum4. On the x-axis the different modes and combinations of MPI processes arerepresented in the form f + p. We observe that depending on the mode and combination ofMPI processes, the execution time can change up to 2× compared to the original code. The useof DLB improves all the executions of the original code. The improvement obtained by usingDLB with respect to the same type of execution of the original code depends on the mode and

58


Figure 46: Simulation of 4 · 105 particles inMareNostrum4

Figure 47: Simulation of 7 · 106 particles inMareNostrum4

combination of MPI processes.

Figure 48: Simulation of 4 · 105 particles onThunder

Figure 49: Simulation of 7 · 106 particles onThunder

Figures 48 and 49 show the average execution time of one time step when simulating 4 ·105

particles (left) and 7 · 106 particles (right) on Thunder. On the Arm-based cluster the trendin performance of this simulation is similar to the Intel-based one. If the user makes a wrongdecision (e.g., running the coupled execution with 96 MPI processes for the fluid and 96 MPIprocesses for the particles) the simulation can be 2× slower than running the best configuration(e.g., synchronous execution). Also on Thunder the use of DLB improves the performance ofall configurations, and minimizes the effect of choosing a bad combination of MPI processes.

Figure 50: Simulation of 4 · 105 particles onDibona

Figure 51: Simulation of 7 · 106 particles onDibona

The average execution time of one time step of the respiratory simulation running on Dibonacan be seen in Figures 50 and 51. We can see that DLB improves all the cases, both with 4 ·105

particles and 7 · 106 particles, independently of the distribution of MPI processes between fluidand particles. We also observe that, as on the other platforms, the simulation execution time canvary by 2x in either direction, depending on the user decisions regarding the mode (synchronous

59


or coupled) and the number of MPI processes assigned to each code (fluid and particles).

Figure 52: Simulation of 4 · 105 particles onPower9

Figure 53: Simulation of 7 · 106 particles onPower9

In Figures 52 and 53 we can see the average execution time of one time step when runningthe simulation on two nodes of the Power9 cluster. The observations are similar to the otherplatforms: the distributions of MPI processes between the fluid and the particles is crucial andDLB improves all the executions and alleviates the negative impact of a bad distribution.

To conclude, we can say that on all platforms the performance depends on the mode (syn-chronous or coupled) and the number of MPIs assigned to each code (fluid and particles). Asthe number of cores per node is different for each platform, it makes the user reconsider thedecision for each cluster. On the other hand, DLB alleviates the difference in performancebetween different possibilities.

4.1.5 Scalability study

The Alya scalability experiments on Dibona have been done using the respiratory system wheninjecting 4 · 105 particles in the human airways. We have evaluated two versions of the code.On one hand, the original code using MPI only, which is the one used in production runs byAlya users. We have labeled this run as Base. On the other hand, the best version from theprevious sections; using Multidependences for the matrix assembly, atomics pragmas for thesubgrid scale, the synchronous version for the coupling of phisics and DLB to improve the loadimbalance at all levels, labeled as DLB.

The runs have been done launching 64 MPI processes per node, and in the case of the hybridversion this implies using one OmpSs thread per MPI process.

Figure 54: Execution time of the respiratorysystem up to 2048 cores

Figure 55: Efficiency of the respiratory systemup to 2048 cores

60


In Figure 54 we can see the execution time of 10 time steps when using from 1 to 32 nodes ofDibona (64 to 2048 cores). In Figure 55 we can see the same numbers represented as Efficiency.We observe that the scalability of the simulation drops for more than 4 nodes (256 processes)with the MPI-only version. The efficiency with the MPI-only version with 4 nodes is 0.75.

The executions using Multidependences and DLB scale better, obtaining an efficiency of0.80 with 8 nodes (512 processes). The execution time on 32 nodes (2048 cores) is 1.73× fasterwith DLB than the MPI version.

4.1.6 Energy measurements

In this subsection we use the power monitoring explained in Section 1.6 to measure the energyto solution of the respiratory simulation when injecting 4 · 105 particles in the human airways.We executed two versions of Alya, the base, pure MPI code, and the hybrid code using theMultidependences implementation of OmpSs and the load balancing mechanism DLB.

Figure 56: Energy to solution of the respiratory simulation in Dibona

In Figure 56 we see the energy to solution in Joules for the two versions when scaling thesimulation from 1 to 32 nodes on Dibona. Also, in Table 24 we analyzed the scalability ofthe energy spent to perform one step of the computation. In the two right most columns, wecomputed the energy delay product, a metric often used to prove the energy proportionalityof solutions aimed at improving the overall efficiency of a computational system. From thisdata, we see how the DLB version of Alya shows a lower energy delay product, decreasing morerapidly when increasing the number of nodes compared to the base version of the code.

4.2 OpenIFS

4.2.1 Introduction

OpenIFS [32] is both an ECMWF project and a model. The OpenIFS project at ECMWFencourages research and teaching numerical weather predictions, from medium range to seasonaltimescales. The OpenIFS programme at ECMWF provides academic and research institutionswith an easy-to-use version of the ECMWF IFS (Integrated Forecasting System), OpenIFSmodel, the single column model (SCM) and the offline-surface model (OSM). The OpenIFSmodel provides the full forecast capability of IFS, supporting software and documentation butwithout the data assimilation system.

61


Table 24: Energy efficiency of the respiratory simulation

Elapsed/step[s] Energy/step [J] Avg Pwr/node [W] EDP [J·s]cores Base DLB Base DLB Base DLB Base DLB

64 38.60 27.76 9017.04 6873.51 233.60 247.62 348055.62 190795.16128 20.01 13.71 9609.10 7656.06 240.12 279.16 192270.21 104984.43256 12.84 7.39 15540.19 7880.31 302.49 266.77 199592.20 58196.32512 8.09 4.34 20487.02 9605.20 316.49 276.77 165770.92 41668.11

1024 4.26 2.75 23086.46 12334.08 338.48 280.72 98416.54 33869.802048 2.79 1.61 27164.38 18450.77 304.79 358.17 75657.75 29702.20

4.2.2 The OpenIFS programme

The ECMWF OpenIFS programme provides an easy-to-use, exportable version of the IFS in useat ECMWF for operational weather forecasting. The programme aims to develop and promoteresearch, teaching, and training on numerical weather prediction (NWP) and NWP-relatedtopics with academic and research institutions.

Use of OpenIFS on topics of interest to member states is encouraged. Research topicsat ECMWF are described in more detail on the main ECMWF site10. Enquiries regardingpotential collaborations are welcome.

OpenIFS provides the forecast-only capability of IFS (no data assimilation), stays close tooperational version and supports operational configurations.

The OpenIFS model ’package’ includes not just the model itself but acceptability tests forcompilers/hardware, example case studies, plotting & analysis tools and so on. The model isintended for research and educational use by Universities, research organisations and individualresearchers on their own computer systems.

4.2.3 About IFS at ECMWF

IFS is a collaborative effort that started between ECMWF and Meteo-France, and today involvesmany ECMWF member-states and associated consortia of the national metereological services.

4.2.4 OpenIFS requirements

The OpenIFS model requires the following packages in order to build and run:

bash - the scripts accompanying the package make use of the Bash shell.

perl - is required for the FCM software used to compile the models.

FORTRAN and C compilers - the FORTRAN compiler must support an auto-double(e.g.,-r8) capability as some source code files are in fixed format FORTRAN. The OpenIFSmodel can make use of OpenMP but this is optional. Note that if OpenMP is used then theversion of MPI used should be thread-safe.

LAPACK & BLAS libraries.

grib-api/eccodes - the grib-api library or its replacement eccodes library and accompany-ing set of commands for working with and manipulating GRIB files. GRIB is the file formatused as input and output for OpenIFS.

10https://www.ecmwf.int/

62

https://www.ecmwf.int/


Figure 57: Execution time of OpenIFS (T255, FCLEN=d01)

MPI library - It can either be vendor supplied or one of the freely available versions suchas MPICH or OpenMPI.

python - is required to run some of the tools. Python 2.7 is recommended.netCDF - an implementation of netCDF must be available on the system as the models

read/write the netcdf format.

4.2.5 Evaluation

4.2.5.1 Node to node comparison between Dibona and MareNostrum4

This section provides a performance node to node comparison of Dibona [33] and MareNos-trum4 [34] when executing OpenIFS. We tested different compilers on both machines. Figure57 presents the strong scaling test performed in one node of each machine for five differentcompilers. The results focus on the total execution time as measured by the application whensimulating one day (i.e. FCLEN=d01 inside the OpenIFS running script).

Each point on Figure 57 corresponds to an average of 5 different runs (variability is alwaysless than 5%). We have tested three different configurations on Dibona:

1. Dibona-Arm corresponds to OpenMPI 3.1.2 with Arm HPC 19.0 as backend compiler.

2. Dibona-GCC8 corresponds to OpenMPI 3.1.2 with GCC 8.2.0 as backend compiler.

3. Dibona-GCC7 corresponds to OpenMPI 3.1.2 with GCC 7.2.1 as backend compiler.

For MareNostrum4 case, the following two configurations are represented in Figure 57:

1. MN4-Intel corresponds to IMPI 2018.3 with Intel 18.0.3 as backend compiler.

2. MN4-GCC7 corresponds to OpenMPI 3.0.0 with GCC 7.2.0 as backend compiler. Prob-lems where found with this configuration to run with more than 16 cores, as reflected inFigure 57.

63


Figure 58: Efficiency of OpenIFS (T255, FCLEN=d01)

In the case of Dibona, both GCC combinations perform almost the same. The use of the Armcompiler introduces an improvement that makes performance on Dibona equivalent to the oneobtained on MareNostrum4 with the GCC compiler. In this case, the Intel compiler providesa huge improvement with respect to the GCC configuration when running on MareNostrum4.Comparing Arm and Intel (both machine and compilers), note how the execution time obtainedwith 64 cores on Dibona with the Arm compiler is comparable with the one obtained with 32cores with Intel on MareNostrum4.

Figure 58 presents the same results but from an efficiency point of view. On Dibona, the ArmCompiler shows a worse efficiency than GCC, but this is mostly due to the better performancewhen running with only 1 MPI process (baseline). Note also that the efficiency obtained by theArm compiler on Dibona is equivalent with the one of Intel compiler on MareNostrum4.

4.2.5.2 Multi node strong scalability on Dibona

The scalability test presented in the previous section has been extended beyond one node inthe case of Dibona. Figures 59 and 60 show the obtained results using the same configurations.Unfortunately, at the time of writing this document, it has not been possible to complete theplot with the points at 2048 cores for the 3 studied compilers due to the problems reported inSection 1.5.3 (i.e. jobs always freeze when trying to run with 32 nodes). Even obtaining resultsfor 16 nodes with the Arm and GCC7 compilers was not possible until now. Note, however,that the trend in all cases seems quite clear.

64


Figure 59: Execution time of OpenIFS (T255, FCLEN=d01)

Figure 60: Efficiency of OpenIFS (T255, FCLEN=d01)

65


4.3 TensorFlow

TensorFlow is an open-source library for Machine Learning (ML) applications e.g., for imageclassification, recommendation of applications, and speech recognition [35]. It is designed andimplemented by the Google Brain Team, written in C++ and Python and some of its parts useCUDA for acceleration on GPUs. It consists of ∼650,000 lines of code. Both the scientific com-munity and industry (e.g., Intel and NVIDIA) contribute to its development11. It is employedas a benchmark of new ML architectures [36] and as ML engine when coupled with well-trainedmodels [37].

In this section we i) evaluate the standard open-source release of TensorFlow v1.10 on Di-bona, ii) we report the performance of TensorFlow v1.10 after plugging the Arm PerformanceLibraries (ArmPL) v19.0 into the original source code and iii) we compare the performanceof TensorFlow on three different state-of-the-art HPC platforms such as the Intel x86 Skylake(MareNostrum4), IBM Power9 (CTE-Power) and Cavium ThunderX2 (Dibona). In all experi-ments of this section we compare the CPU-only performance of the training phase and we useOpenMPI v2.0.2.14.

4.3.1 Basic concepts

Since the TensorFlow framework is relatively new and not focusing specifically HPC, we dedi-cate this section to introduce the basic concepts needed to understand the framework and theevaluation that we performed on Dibona.

Model – Machine learning models are sets of rules that will predict some output for aparticular given input under certain circumstances. These circumstances are parameters thatsummarise the relationships between input data [38] [39]. This section will focus on the modelsAlexNet and ResNet-50.

Workload – During the training phase of a model, an epoch refers to a single iteration overthe complete dataset composed of E data elements (images in our case). A batch is a subset ofthe epoch, processing a dataset smaller (or equal) than the complete one, composed of B dataelements. A step refers to the work for processing one batch of the dataset and accordinglyupdating the model parameters. The number of steps S of an epoch are S = E/B.

Figure of merit – Our evaluation focuses on the training phase of the neural network.We use models for image recognition. The figure of merit for our evaluation is images persecond img/s. To keep the comparison simple, we just check that the accuracy of the modeland convergence time of the training process are preserved among different tests.

TensorFlow components – A tensor is an n-dimensional array, with n ∈ N while a graphdescribes the computation among tensors as a directed graph, aka dataflow computation [40].TensorFlowoperations are the nodes in a TensorFlow graph. Each of these operations takes oneor more tensors as input and it returns zero or more tensors as output. A TensorFlow kernelis the implementation of a given operation. Kernels can differ depending on the type of datathey are treating (e.g., int, float, double) or the type of device on which the operation will beperformed (e.g., there are specialized kernels for GPUs). Concerning parallelism, TensorFlowdefines intra-ops as the number of threads used by a kernel and inter-ops the level of parallelismexpressed at the node level of a graph, in other words how many different kernels can beperformed at the same time. Horovod [41] is a distributed training framework for TensorFlowand others machine learning frameworks.

11https://github.com/tensorflow/tensorflow/tree/master/tensorflow

66

https://github.com/tensorflow/tensorflow/tree/master/tensorflow


4.3.2 Profiling

In this section, we present the profiling of a TensorFlow training session. Our base code forTensorFlow is the version 1.10.012 and we use the benchmark tool from the TensorFlow repos-itory13. The model is AlexNet and the input set is based on synthetic data, so we do notread real datasets (e.g., ImageNet) to avoid intensive IO operations. The batch-size of our testsincludes 1024 images. For profiling a simple case, we configure the number of inter-ops to 1,so we process one operation of the graph at a time, and number of intra-ops to 64, so to takeadvantage of all the computational resources within a Dibona node (64 threads).

To profile a training step with AlexNet we use a TensorFlow built-in trace generator.Traces generated by this tool can be visualized with Chromium or Chrome. In Figure 61, weshow one trace of the Vanilla version of TensorFlow. The trace visualization tool gives us threemain pieces of information: the type of operation executed (red box), tensor lifespan (blue),and memory usage (green).

Figure 61: TensorFlow visualizer of an execution using inter-ops=1 and intra-ops=64

The trace visualizer can also provide the aggregated execution time of every operation. Withthis information, we are able to understand which are the most time consuming functions. InFigure 62, we can see that∼70% of the total execution time is spent in the Conv2DBackpropInputfunction, while ∼18% of the time is spent in Conv2DBackpropFilter.

Of course, speeding up these two operations would improve the overall performance. Weprofiled (with perf) the inner calls and matched the most time-consuming functions and oper-ations. In table 25, we show that most of the execution time is spent in routines from a librarycalled Eigen.

12https://github.com/tensorflow/tensorflow/releases13https://github.com/tensorflow/benchmarks

67

https://github.com/tensorflow/tensorflow/releases

https://github.com/tensorflow/benchmarks


Figure 62: Time distribution among operations for the AlexNet model

Table 25: Perf report output of TensorFlow/1.10, only selecting functions with overhead >1%

Overhead Shared object Symbol

76.58% pywrap tensorflow internal.so Eigen::internal::gebp kernel<float, float, long,...4.44% pywrap tensorflow internal.so Eigen::internal::gemm pack rhs<float, long,...4.06% pywrap tensorflow internal.so Eigen::internal::gemm pack rhs<float, long,...2.02% pywrap tensorflow internal.so std:: Function handler<void (long, long),...1.50% pywrap tensorflow internal.so Eigen::internal::gemm pack lhs<float, long,...1.42% pywrap tensorflow internal.so (anonymous namespace)::Col2im<float>1.32% libc-2.17.so memcpy

Eigen14 is a C++ library for linear algebra. In the generic kernels (i.e., the generic implemen-tation of an operation), Eigen is used for tensor operations. The Eigen::internal::gebp -kernel function spends ∼76% of the time in the subroutine of Eigen general block-times-panelmultiply for matrix multiplication. Once the significance of Eigen calls to the total executiontime was identified, we analyzed Conv2DBackpropInput and Conv2DBackpropFilter forisolating Eigen calls.

4.3.3 Arm Performance Libraries evaluation

Since the Eigen library is a collection of linear algebra functions, we studied how to replacesome of the Eigen calls with the corresponding optimized implementations that can be found inthe Arm Performance Libraries (ArmPL). Due to the dominance in the execution time, we inte-grate calls to ArmPL first into the Conv2DBackpropInput and Conv2DBackpropFilteroperations.

In Figure 63, we show the output of the TensorFlow trace visualizer before (Vanilla) andafter (TF-ArmPL) implementing the two operations with ArmPL. The total execution time of astep is reduced by ∼2.5×. In Figure 64, we show the breakdown of the execution time of a stepinto operations, aggregating the time of each operation. We see that the most noticeable speed-up, ∼5.6×, happens in Conv2DBackpropInput, while in the Conv2DBackpropFilter weobserve a ∼13% reduction in execution time.

14http://eigen.tuxfamily.org/

68

http://eigen.tuxfamily.org/


Figure 63: Timelines of the same step using the Vanilla TensorFlow (top) and TF-ArmPL(bottom)

Figure 64: Breakdown of the execution time of a step

4.3.4 Single-node evaluation

In this section, we report the measurements performed on a Dibona node, with and withoutArmPL. To this end, we benchmarked the training phase using two models, AlexNet andResNet-50. Also, we compare Dibona with MareNostrum4 and CTE-Power. We use a hybridconfiguration (MPI + shared memory) within the node, using as performance metric img/s.In order to use distributed memory, we used Horovod and a variable number of intra-threads.In the case of 1× 64 (MPIs processes × threads), we do not use Horovod.

Figure 65: Single-node evaluation (MPIs processes × threads) for the AlexNet model

69


Figure 66: Single-node evaluation (MPIs processes × threads) for the ResNet-50 model

In Figures 65 and 66, we see on the Y -axis the training performance measured in img/sand on the X-axis different configurations of MPI processes × threads. In the AlexNet modelin Figure 65, we obtain ∼2.8× more performance than the purely shared memory configurationbetween Vanilla and the TF-ArmPL version without Turbo mode. In the other configurations,we get improvements in the range from 1.44× to 1.89×. In ResNet-50 model shown in Figure 66,between Vanilla and the TF-ArmPL versions, we obtain an improvement in the range from 1.37×to 1.51×. In either case, activating Turbo mode leads to an average performance improvementof 20%.

We then compare a node of Dibona with a node of MareNostrum4 and Power9. On thePower9 platform, we do not use GPU acceleration. On MareNostrum4 we evaluate two versions,with and without the Intel Math Kernel Library (MN4-MKL)15 16, Figure 67 reports the resultsof the comparison.

Figure 67: Comparison between different platforms. The configurations on Dibona are the bestfor each model as analyzed in Figures 65 and 66. The configuration on MareNostrum4 is oneprocess and 48 threads, while on Power9 we tested one MPI process per physical core.

In Figure 67, the version MN4-MKL shows the best results in sustained performance for thetwo evaluated models. Also, we see that on MareNostrum4 there is a significant performancedifference (7× with AlexNet and 6× with ResNet-50) when using MKL and not. The

15https://software.intel.com/en-us/mkl16https://github.com/intel/mkl-dnn

70

https://software.intel.com/en-us/mkl

https://github.com/intel/mkl-dnn


second best performance for the two models is on Dibona platform, ∼3.75× slower for AlexNetthan MN4-MKL and ∼2.72× slower for ResNet-50. Dibona is ∼1.64× faster than Power9for AlexNet and ∼1.23× faster for ResNet-50. The gap between MareNostrum4 versions,with and without MKL, shows us how important is to have a well optimized linear algebraimplementation. This can also be seen on Dibona with the Power9 comparison. The Vanillaversion of TensorFlow on Dibona, seen in Figures 65 and 66, is ∼1.07× slower in AlexNet and∼1.37× slower in ResNet-50. This shows us that a good lineal algebra library, like ArmPL,can make a platform get more performance than others.

71


4.4 LBC

Contrary to earlier plans, it has not been possible until now to run the application Tangaroaon Dibona. Incompatibilities between compiler versions, the OmpSs runtime system, and thesource code did not allow to build the application on Dibona. As a contingency, we used theLattice Boltzmann code LBC to test the platform and the OmpSs programming model.

LBC is a Lattice Boltzmann code written in Fortran. It was used previously in the Mont-Blanc 2 project, and thus already implements a second-level domain decomposition on top ofthe first-level MPI domain decomposition, which is suitable for tasking. As such, only few codechanges needed to be made.

In particular, we used the pure-MPI version of LBC to compare node-to-node performancebetween Dibona and the Cray XC40 Hazel Hen installed at HLRS. Further, the pure-MPIversion has been compared to several OmpSs versions which expose increasingly more potentialto overlap computation and communication. One of the versions exploits the communicationthread approach developed as part of task T7.5 (see D7.13). Interestingly, the communicationthread approach works very well on Hazel Hen, but is beneficial on Dibona only if a CPU coreis dedicated to the communication thread and thus not available for computation. Again, thisbehaviour has been analysed and reported in D7.13. On Dibona, the communication threadapproach is susceptible to thread starvation at the operating system’s thread scheduler level.

4.4.1 Experimental setup

The following details regarding the experimental setup are common to experiments and perfor-mance data reported below. The experiments have been done on Dibona and on Hazel Hen.

Dibona’s hardware and software stack is described in Section 1. We used the GCC compilerv7.2.1 and Open MPI v2.0.2.14. For the OmpSs experiments, we used a version based on therelease 10.2017 modified to implement the communication thread.

Hazel Hen is a Cray XC40 system with nodes consisting of two sockets with Intel XeonE5-2680 v3 Haswell CPU, with 12 cores each. The interconnect is a Cray Aries with dragonflytopology. We have used the GNU Programming environment version PrgEnv-gnu/6.0.4 withGCC v7.2.0 (rather than the default v7.3.0) and Cray MPICH v7.7.3.

Each data point in this section’s figures corresponds to the arithmetic mean over a sample ofat least 20 time measurements. Error bars correspond to the standard deviation of the sampleand are shown in all data points in the subsequent figures. In most cases, they will not be easilyvisible as they are small compared to the dimension of the plotted data.

In the Lattice Boltzmann community, the underlying grid cells are often referred to as latticeelements. Also, the usual metric for performance is number of lattice updates per time intervalmeasured in units of MLUPS (mega lattice updates per second, 106 lattice updates per second).This metric is reported by the application at the end of the run. Note that LBC disregards theinitialisation phase and other overhead when reporting performance.

For comparison purposes, the problem size of all experiments is chosen such that the numberof lattice elements is nc = (256× 256× 32) per core, so only weak scaling has been considered.We have not plotted scaling curves increasing the number of cores per node, and we alwaysused full nodes instead. The problem sizes per node are represented as nN and the domaindecomposition (3-dimensional) with MPI ranks arranged as pN , both shown in Table 26.

The code was run for 10 timesteps without any intermediate output. This proved sufficientlylarge for accurate time measurements.

72


Table 26: Summary of single-node experimental results for LBC.

Dibona Hazel Hen

Cores 64 24nN 512× 512× 512 512× 512× 192pN 2× 2× 16 2× 2× 6Performance (265.36± 3.04) MLUPS (160.96± 1.27) MLUPSEnergy efficiency (816 398± 1028) MLUP/J (465 835± 637) MLUP/J

4.4.2 Node-to-node comparison on Dibona and Hazel Hen

In order to evaluate the computing capabilities and energy efficiency of the Arm platform, wecompared the pure-MPI version of LBC on a single-node of Dibona and Hazel Hen, respectively.We assume that the MPI communication within a single node will have little impact on theperformance of the code. Single-core runs were disregarded, as one core is not sufficient tosaturate the available memory bandwidth on either system, and such performance measurementswould overestimate and obfuscate the real application performance in production runs.

4.4.2.1 Performance

We have done 40 runs of LBC on each machine. In order to get a representative result, atmost 5 runs were done within the same job, and jobs were distributed over 2 or more days.

The performance, in terms of the application specific metric, is (265.36± 3.04) MLUPS pernode for Dibona, and (160.96± 1.27) MLUPS per node for Hazel Hen. Note that the error isjust above 1 % in both cases. The node performance of Dibona is thus roughly 65 % higher thanthat of Hazel Hen; at the same time, the number of cores per node is roughly 160 % higher.This indicates that, proportionally, a core on Hazel Hen outperforms a core on Dibona.

4.4.2.2 Energy consumption

We measured the energy to completion in Joules. LBC has a relatively long initialisationphase, which is disregarded in the application’s own performance measurement. Unfortunately,we could only accurately read out the energy consumption counters at the beginning and theend of the application. In order to decrease the impact of the initialisation phase on theenergy reading, we increased the number of timesteps from 10 to 500 for these measurements.We expected that the initialisation phase accounts for less than a few percent of the energyconsumption, which has been confirmed by varying the number of timesteps. Note that theerror reported below is the statistical standard deviation of the measurement sample and doesnot include this initialisation bias.

We have chosen to report energy efficiency in terms of lattice updates (i.e. work done) perconsumed energy. For Dibona, the energy efficiency is (816 398± 1028) MLUP/J; for Hazel Henit is (465 835± 637) MLUP/J. The statistical error is just above 1 percent. The results showthat the energy efficiency in terms of the domain-specific metric MLUP/J is almost twice onDibona compared to Hazel Hen.

4.4.2.3 Discussion of observations

The performance and energy consumption measurements for single-node runs are sum-marised in Table 26.

The application specific metric MLUPS is dominated by floating-point operations per timeinterval (FLOPS); essentially these two quantities are proportional to each other. It is thereforejustified to state that comparing node-per-node, the performance of Dibona in terms of FLOPS

73


is 65 % higher than on Hazel Hen. Since the number of cores on Dibona is much higher, theperformance of Hazel Hen is higher when comparing core-per-core.

Using again the proportionality of MLUPS and FLOPS, the energy efficiency of Dibona interms of FLOPS/J is almost twice compared with that of Hazel Hen.

4.4.3 Comparison between pure-MPI and hybrid OmpSs/MPI

LBC used domain decomposition for parallelisation across MPI ranks. As with most stencil-codes, only the surface of the domains, sometimes referred to as ghost-cells, needs to be ex-changed between nearest-neighbours. In addition, LBC uses a second-level domain decompo-sition to split MPI-domains into smaller units which we refer to as tiles. Tiles are used forparallelisation with shared-memory programming models such as OpenMP and OmpSs. In anutshell, there is a large loop over all tiles which dispatches the Lattice Boltzmann kernel foreach tile. Once this loop completes, a routine is started to exchange ghost-cells using MPI.This basic structure is common to the pure-MPI version and the hybrid MPI/OmpSs versionsintroduced below.

The hybrid MPI/OmpSs versions essentially just add appropriate OmpSs annotations. Allversions encapsulate all MPI communication in a single OmpSs tasks referred to as exchange.Note, that there is multiple MPI calls and as well as local data movement in this task. Also,computation on tiles is encapsulated in tasks. For simplicity, we will refer to tasks whichcompute tiles on the surface of the MPI domain as outer; conversely, tasks which computetiles which are not part of the surface are referred to as inner. Note, that this distinction isnot made explicitly in the code, and it arises from the different dependencies to the MPI taskexchange, as explained below. There is only a single task exchange, but many inner andouter tasks.

We created three versions which expose increasingly higher potential to overlap communi-cation and computation.

fork-join The dependencies between tasks are set up in such way to make the exchangetask depend on all outer tasks, but also on all inner tasks. Basically, this mimicsthe behaviour of an OpenMP loop-parallel version without task dependencies; this is alsofunctionally equivalent to a pure-MPI version. The only difference to the MPI version isthe number of tiles which is proportional to the number of cores (which is one for MPI).There is no potential for overlapping communication with computation in this version.

comm-hiding This version leverages the fact that only data on the surface of the domainneeds to be exchanged with neighbours via MPI. All of the surface is fully calculatedby tasks of type outer. Task dependencies are therefore set up in such a way that theexchange tasks depends only on the outer task and no dependencies on the innertasks. The OmpSs scheduler can in principle start the execution of the exchange assoon as all outer tasks complete and concurrently execute some of the inner tasks,thus overlapping communication with computation. If the communication task is shorterthan the duration of all inner tasks, then the communication can be fully hidden.

comm-thread This version uses the same dependency setup as comm-hiding. However,the tasks exchange is declared with the clause comm thread. Tasks declared withcomm thread are not executed by a regular OmpSs worker thread, but by a dedicatedcommunication thread. Regular worker threads will continue executing other tasks.

The communication thread approach has been developed within MB3 as part of the workpackage WP7 as a method to increase overlap of communication and computation. It should

74


Performance [MLUPS/node]nodes 1 4 16

pure-MPI 161.0± 1.3 152.4± 2.5 138.3± 1.3

fork-join 157.1± 1.0 152.3± 1.7 147.1± 1.2

comm-hiding 162.0± 0.7 159.7± 1.0 155.2± 1.2

comm-thread 158.7± 1.0 152.3± 1.2 148.3± 1.1

Table 27: Summary of the performance of LBC of week scaling experiments on Hazel Hen.

Performance [MLUPS/node]nodes 1 4 16

pure-MPI 265.4± 3.0 262.8± 0.5 214.5± 8.5

fork-join 255.7± 1.4 252.2± 0.9 248.3± 0.8

comm-hiding 273.2± 1.0 273.2± 0.4 269.8± 0.8

comm-thread 273.1± 1.3 273.0± 0.8 132.9± 13.8

Table 28: Summary of the performance of LBC of week scaling experiments on Dibona

benefit from the fact that many MPI implementations are capable (to varying degree) to offloadMPI communication to the network interface hardware with only little effort by the CPU.Executing such tasks on an additional thread will allow regular worker threads to continuecalculations with only little disturbance by the background MPI communication.

Initial experiments on Dibona showed that the communication thread was susceptible tothread starvation when oversubscribing the number of cores (see D7.13[]MB3D713). All exper-iments on Dibona (with 16 cores available per MPI process) thus have been conducted with 15OmpSs worker threads plus 1 communication thread, rather than 16 worker plus 1 communi-cation thread. Note, that on Hazel Hen oversubscription of cores was not an issue.

We have done only weak scaling experiments keeping the problems size per core nN , thesame as in the single-node runs presented above (see Table 26). All MPI/OmpSs versions use4 MPI ranks per node with a layout of pN = (2 × 2 × 1). The remaining cores on a node areused as OmpSs worker threads. For the MPI version, the third dimension is used as given inTable 26. In both cases, i.e. pure-MPI and hybrid MPI/OmpSs, parallelisation across nodes isdone along the third dimension.

For all experiments, we have used tiles of size (32× 32× 32). Given the dimension per coreas reported above in Table 26, we have nominally 64 tiles per core (i.e. worker thread).

4.4.3.1 Performance

We performed at least 20 runs of LBC on each machine. In order to get a representativeresult, at most 10 runs were done within the same job, and jobs were distributed over 2 days, inmost cases. Again, we use the application specific metric MLUPS as a proxy for performance.Error bars are plotted on all data points in the subsequent figures. Unless stated otherwise, thestatistical deviation of measurements is lower than 2 %, and in many cases even lower than 1 %.

The results of a weak scaling experiment on Hazel Hen are shown in Table 27 and illustratedin Figure 68. On one node, the performance of all four versions of the code is very similar, in fact,they are in a statistical sense not significantly different. However, the versions pure-MPI andcomm-hiding tend to be slightly faster than the versions fork-join and comm-thread. Overall,the difference between the best and the worst version is just around 3 %.

When increasing the number of nodes from 1 to 4 and 16, the scaling behaviour is roughlysimilar for the four versions, but less efficient for the pure-MPI version. At 16 nodes, the

75


0

50

100

150

200

1 4 16

ML

UP

S/n

od

e

#nodes

pure-MPIfork-join

comm-hidingcomm-thread

Figure 68: Weak scaling of LBC on Hazel Hen for different parallelisation approaches. Inaddition to a pure-MPI version, the figure shows three hybrid OmpSs/MPI versions, namelyfork-join, comm-hiding, and comm-thread. The performance is recorded in the applicationspecific metric MLUPS per node. Higher values correspond to better performance.

0

50

100

150

200

250

300

1 4 16

MLU

PS

/no

de

#nodes

pure-MPIfork-join

comm-hidingcomm-thread

Figure 69: Weak scaling of LBC on Dibona for different parallelisation approaches. n additionto a pure-MPI version, the figure shows three hybrid OmpSs/MPI versions, namely fork-join,comm-hiding, and comm-thread. The performance is recorded in the application specific metricMLUPS per node. Higher values correspond to better performance.

76


pure-MPI version is significantly slower than the version comm-hiding. The performance offork-join and comm-thread cannot be statistically distinguished, but is between that of theother two versions. The scaling efficiency for all hybrid MPI/OmpSs versions from 1 to 16nodes is excellent, at above 93 %, and lower, at 86 %, for the MPI version.

The results of a weak scaling experiment on Dibona are shown in Table 28 and illustratedin Figure 69. On one node, the performance of comm-hiding and comm-thread are identicaland higher than that of any other version. The version fork-join is slower with statisticalsignificance. The performance of the MPI version is between those two groups. However, thedifference between the best and the worst version is just above 6 %.

Scaling from 1 to 4 nodes is excellent for all versions with no statistically significant changein any of the version’s performance. The versions comm-hiding and fork-join continue to scaleout to 16 nodes. The scaling efficiency from 1 to 16 nodes is above 97 % for these two versions.The performance of the versions pure-MPI and comm-thread, however, drops significantly at 16nodes. At the same time, variation between runs increases, which leads to unusually large errorbars of 4 % and 10 %, respectively.

The behaviour of the various versions at 16 nodes is counter-intuitive and not clear todate. In particular, pure-MPI and fork-join, both do MPI communication after completingall computations and transfer the same amount of data across nodes. They should be equallyaffected by any network or MPI issue.

4.4.3.2 Energy efficiency

We measured the energy efficiency for a few samples. As expected, any change in energyefficiency is fully explained, within the given error limits, by changes in performance. We havefound no statistically significant effect due to the different parallelisation approaches.

4.4.3.3 Discussion of observations

First of all, it is worth noting that the problem size per node is larger on Dibona thanon Hazel Hen, but the amount of data moved between nodes is the same. Thus, the MPIcommunication time is relatively more important on Hazel Hen than on Dibona. Secondly,communication can be overlapped only with computation of the inner tiles which do not partic-ipate in MPI communication. However, the ratio of number of inner tiles to the total number oftiles is 0.49 for Dibona and only 0.38 for Hazel Hen. Thus, it is more challenging to completelyhide communication on Hazel Hen than on Dibona. Together, this might explain why scalingon Dibona is almost perfect (with the exception discussed below).

On both systems, however, communication overlap using OmpSs tasks works in principle.In particular the comm-hiding strategy is always better than pure-MPI. Contrary to our expec-tations, the version comm-thread, which conceptually has the largest potential for overlap, didnot work better than comm-hiding.

The results on Dibona at larger scales are inconclusive. In particular, it remains unclearwhy the performance of two version drops so significantly; in fact, it is not even clear what thesetwo version have in common.

The absolute performance of the hybrid OmpSs versions, is comparable to the MPI onlyversion, showing that OmpSs overheads are small for the given task granularity.

4.4.4 Conclusions

The purpose of using LBC was twofold: comparing the absolute performance and scaling ofthe Arm-based Dibona platform to a more traditional Cray XC40, and evaluating OmpSs as amean to overlap MPI communication with computation.

77


The absolute performance comparing node-per-node is larger on Dibona, but by the amountexpected from the increased number of cores. On the other hand, the energy efficiency of Dibonais clearly much higher than that of the older Cray XC40 system.

Comparing the hybrid OmpSs versions to the MPI-only version shows that taskification isa powerful tool. On both systems, the OmpSs versions manage to overlap communication withcomputation, as expected.

Unexpectedly, the approach to offload communication to a dedicated communication threaddid not outperform regular communication overlap approaches. In addition, the experimentson Dibona showed that the communication thread is susceptible to thread starvation. Furtherresearch into the impact of operating system thread scheduling parameter is necessary.

Lastly, the experiments were done at very small number of nodes, at least in an HPC context.It is therefore impossible to extrapolate the networking performance and scaling behaviour ofthe LBC application for Exascale systems.

78


5 Conclusions

The WP6, aims at providing a set of kernels, mini-apps, and applications for Arm-based plat-

forms evaluation, co-design, and system software assessment. Partners involved in WP6 worked

on the applications of their interest, focusing in the last period of the project on the evalua-

tion of the project test-platform, Dibona. This document recollects all Dibona tests performed

during the last year of the project. We report the results of a comprehensive benchmarking

campaign using a bottom-up approach, from simple tests to complex multi-node executions.

Micro-benchmarking of the test-platform architecture – has been reported in Section 2. Us-

ing micro-kernels, we measured the three most critical subsystem of each HPC system: the

memory bandwidth and latency, the floating-point throughput and the network bandwidth and

latency. Results show that the eight DDR4 memory channels of the Cavium ThunderX-2 SoC

deliver 218 GB/s with the STREAM benchmark, reaching the 64% of the nominal available

bandwidth. Since the core implements a NEON SIMD unit with a vector width of 128 bits, the

pure floating point performance is lower than the one of cores with wider SIMD units (e.g., ∼ 4×lower than a Skylake Platinum CPU. Finally, we evaluated the network subsystem, reporting a

bandwidth of ∼ 12 GB/s, equivalent to 95% of the peak with packages larger than 32 kB. We

also performed sanity checks of the network links and evaluation of the collective operations

made available by different MPI implementations. This evaluation has been performed by BSC

and helped ATOS/Bull in spotting sub-optimal system software configurations.

Results of “classical” HPC benchmarks and proxy-apps – are presented in Section 3. We

reported the evaluation of benchmarks and mini-applications well known and acknowledged by

the HPC community. Aiming at maximizing the efficiency of the benchmarks, we focused on

finding reasonable parameters, compiler flags and libraries reaching e.g., HPL Rpeak ∼ 80%

with 1000 cores. An added value of this section is that our results that can be easily compared

with the ones of other reference systems (e.g., Top500 and Green500). BSC took care of this

part.

Since Mont-Blanc 3 advocated the MPI+X parallelization model, we include in this section

the description and the evaluation in Dibona of OpenMP/OmpSs implementations of HPCG

and Lulesh performed within the Mont-Blanc 3 project, corroborating the idea that a clever run-

time handling the parallelism within the compute nodes produces performance benefits without

harming the scalability. BSC was responsible for this tasks.

As several scientific codes leverage external solver in their implementations, we evaluated

the Eikonal and Jacobi solvers on Dibona. Also, due to the increasing relevance of accelerators

in modern HPC systems, we implemented and evaluated a heterogeneous version MPI+CUDA

of the Eikonal solver. Even if this version could not be tested on Arm-based clusters due to

the lack of support for Arm by NVIDIA, this work is still relevant for the project. This work

has been performed by the University of Graz. A third solver by AVL has been implemented

in the form of proxy-app and evaluated on Dibona showing a node-to-node performance about

79


two times slower than a state-of-the-art Xeon Skylake.

Porting and testing of production applications – Since Dibona has been officially announced

as an Arm-based HPC solution and sold by ATOS/Bull as production system17, we report in

Section 4 the scalability results of four production codes: Alya, a finite element framework

by BSC, OpenIFS, a weather forecast program by the European Centre for Medium-Range

Weather Forecasts, LBC, a lattice Boltzmann code for fluid dynamic simulations, and Tensor-

Flow, one of the most popular framework for machine learning. For all these applications we

reported figures of performance and energy as well as scalability showing consistent performance

on Dibona. Also for TensorFlow we implemented a new version of operations leveraging the

Arm Performance Libraries for the most used linear algebra kernels. Results show speedups

between 2× and 5× on Dibona and overall performance figures comparable with state-of-the-art

HPC architectures based e.g., on Intel and IBM Power9. All production applications have been

evaluated by BSC except for LBC that has been performed by the University of Stuttgart.

One of the missions of WP6 was to demonstrate the paramount importance of advanced run-

time techniques for being able to efficiently port large production codes to new architectures.

The idea is to demonstrate how runtimes, deployed at the level of system software, require

minimal or even no changes in the source code, boosting the performance without harming

portability nor the semantic of the source code. On the longer term, we believe tools like the

multidependences and DLB, successfully evaluated e.g., with Alya in this project, will allow pro-

grammers to survive to the waves of architectural novelties without drowning into fine-tuning

optimizations of the code. A proof of the importance of these ideas is the acceptance of the mul-

tidependences into the OpenMP5.0 standard. Once more the Mont-Blanc project contributed

to the HPC community providing ideas, evaluating them and demonstrating their potential.

Since Arm-based systems are “de-facto” a reality (see e.g., the case of Astra18), we supported

all our studies and evaluations in this document with performance comparisons with other Tier-0

and Tier-1 HPC clusters and with different systems/architecture (mostly Intel/Cray, Intel/Len-

ovo, and IBM Power9). Also, when relevant, we studied figures of power and energy. We are far

from the ages of Mont-Blanc 1, when we were using mobile SoCs, however overall we noticed

an excellent energy proportionality compared to other HPC solutions.

As for last comment, we want to highlight the huge effort spent in the phase of bring-

up and deployment by BSC (WP6) and ATOS/Bull (WP3). Even if not strictly related to the

application evaluation, we considered important to report in Section 1 the Dibona configuration

and a summary of the several issues that have been isolated and solved during the early days

of deployment of Dibona.

17https://atos.net/en/2018/press-release_2018_11_08/cea-acquires-bullsequana-supercomputer-atos-equipped-marvell-thunderx2-arm-based-processors-218https://www.top500.org/news/sandia-to-install-first-petascale-supercomputer-powered-by-arm-processors/

80

https://atos.net/en/2018/press-release_2018_11_08/cea-acquires-bullsequana-supercomputer-atos-equipped-marvell-thunderx2-arm-based-processors-2

https://www.top500.org/news/sandia-to-install-first-petascale-supercomputer-powered-by-arm-processors/


References

[1] Fabio Banchelli Gracia, Daniel Ruiz Munoz, Ying Hao Xu Lin, and Filippo Mantovani.In SC17: International Conference for High Performance Computing, Networking, Storageand Analysis, 2017. 1.2.3

[2] Daniel Hackenberg, Thomas Ilsche, Joseph Schuchart, Robert Schone, Wolfgang E Nagel,Marc Simon, and Yiannis Georgiou. Hdeem: high definition energy efficiency monitoring.In Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing,pages 1–10. IEEE Press, 2014. 1.6

[3] John D. McCalpin. Stream: Sustainable memory bandwidth in high performance com-puters. Technical report, University of Virginia, Charlottesville, Virginia, 1991-2007. Acontinually updated technical report. http://www.cs.virginia.edu/stream/. 2.1

[4] Larry W McVoy, Carl Staelin, et al. lmbench: Portable tools for performance analysis. InUSENIX annual technical conference, pages 279–294. San Diego, CA, USA, 1996. 2.1

[5] Intel mpi benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks - Last accessed Jul. 2018. 2.3

[6] OSU Micro-Benchmarks. Osu network-based computing laboratory. URL: http://mvapich.cse. ohio-state. edu/benchmarks. 2.3

[7] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. The LINPACK benchmark: past,present and future. Concurrency and Computation: practice and experience, 15(9):803–820,2003. 3.1

[8] Hpl tuning. https://www.netlib.org/benchmark/hpl/tuning.html - Last ac-cessed Sept. 2018. 3.1.1

[9] Piotr Luszczek, Jack J. Dongarra, David Koester, Rolf Rabenseifner, Bob Lucas, JeremyKepner, John McCalpin, David Bailey, and Daisuke Takahashi. Introduction to the HPCchallenge benchmark suitel. Technical report, Lawrence Berkeley National Laboratory,2005. 3.2.1

[10] Official HPCG benchmark source code. (commit 0281412 on 15 nov 2017). 3.2.2

[11] T. Iwashita and M. Shimasaki. Algebraic multicolor ordering for parallelized iccg solverin finite-element analyses. IEEE Transactions on Magnetics, 38(2):429–432, March 2016.3.2.4.1, 3.2.4.2

[12] Steven S. Skiena. The Algorithm Design Manual. Springer-Verlag London, 2008. 3.2.4.1

[13] Yousef Saad. Iterative Methods for Sparse Linear Systems: Second Edition. SIAM, April2003. Google-Books-ID: ZdLeBlqYeF8C. 3.2.4.1

[14] Kiyoshi Kumahata, Kazuo Minami, and Naoya Maruyama. High-performance conjugategradient performance improvement on the k computer. The International Journal of HighPerformance Computing Applications, 30(1):55–70, February 2016. 3.2.4.2

[15] Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhi-raj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey.

81

https://software.intel.com/en-us/articles/intel-mpi-benchmarks

https://software.intel.com/en-us/articles/intel-mpi-benchmarks

https://www.netlib.org/benchmark/hpl/tuning.html


Efficient shared-memory implementation of high-performance conjugate gradient bench-mark and its application to unstructured matrices. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage and Analysis, SC ’14,pages 945–955, Piscataway, NJ, USA, 2014. IEEE Press. 3.2.4.2

[16] Gaurav Mitra, Beau Johnston, Alistair P Rendell, Eric McCreath, and Jun Zhou. Use ofSIMD vector operations to accelerate application code performance on low-powered ARMand intel platforms. In Parallel and Distributed Processing Symposium Workshops & PhDForum (IPDPSW), 2013 IEEE 27th International, pages 1107–1116. IEEE, 2013. 3.2.4.4

[17] Nikola Rajovic, Paul M Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and MateoValero. Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? InProceedings of the International Conference on High Performance Computing, Networking,Storage and Analysis, page 40. ACM, 2013. 3.2.4.4

[18] Alejandro Rico, Jose A Joao, Chris Adeniyi-Jones, and Eric Van Hensbergen. ARM HPCecosystem and the reemergence of vectors. In Proceedings of the Computing FrontiersConference, pages 329–334. ACM, 2017. 3.2.4.4

[19] Initial report on automatic region of interest extraction and porting to openmp4.0-ompss.Deliverable D6.5 of the Montblanc-3 project, 2017. 3.3.1, 4.1, 4.1.1

[20] Report on application tuning and optimization on arm platform. Deliverable D6.4 of theMontblanc-3 project, 2017. 3.3.1, 3.6, 3.6.1

[21] Report on regions of interest as mini application candidates. Deliverable D6.2 of theMontblanc-3 project, 2016. 3.3.2

[22] Duane Merrill and Michael Garland. Single-pass parallel prefix scan with decoupled look-back. NVIDIA Technical Report NVR-2016-002, 2016. 3.5.2

[23] Duane Merrill. Cub library. https://nvlabs.github.io/cub, NVIDIA Research,2013. 3.5.2

[24] Report on profiling and benchmarking of the initial set of applications on arm-based hpcsystems. Deliverable D6.1 of the Montblanc-3 project, 2016. 3.6, 4.1

[25] Patrick Schiffmann, Dirk Martin, Gundolf Haase, and Gunter Offner. Optimizing a RBFinterpolation solver for energy on heterogeneous systems. In Parallel Computing is Every-where, Proceedings of the International Conference on Parallel Computing, ParCo 2017,12-15 September 2017, Bologna, Italy, pages 287–296, 2017. 3.6, 3.6.1

[26] M. Vazquez, G. Houzeaux, S. Koric, A. Artigues, J. Aguado-Sierra, R. Arıs, D. Mira,H. Calmet, F. Cucchietti, H. Owen, A. Taha, E.D. Burness, J.M. Cela, and M. Valero.Alya: Multiphysics engineering simulation towards exascale. J. Comput. Sci., 14:15–27,2016. 4.1

[27] Alya. https://www.bsc.es/research-development/research-areas/engineering-simulations/alya-high-performance-computational - Lastaccessed Nov. 2016. 4.1

[28] OpenMP Architecture Review Board. Openmp 5.0 specification. Technical report, Novem-ber 2018. 4.1.1.1

82

https://www.bsc.es/research-development/research-areas/engineering-simulations/alya-high-performance-computational

https://www.bsc.es/research-development/research-areas/engineering-simulations/alya-high-performance-computational


[29] M. Garcia, J. Corbalan, and J. Labarta. Lewi: A runtime balancing algorithm for nestedparallelism. In 2009 International Conference on Parallel Processing, pages 526–533, Sept2009. 4.1.1.2

[30] Marta Garcia, Jesus Labarta, and Julita Corbalan. Hints to improve automatic load bal-ancing with lewi for hybrid applications. Journal of Parallel and Distributed Computing,74(9):2781 – 2794, 2014. 4.1.1.2

[31] Dynamic load balancing library. https://pm.bsc.es/dlb - Last accessed Nov. 2018.4.1.1.2

[32] Daniel Varela Santoalla, Glenn Carver. OpenIFS. https://software.ecmwf.int/wiki/display/OIFS/OpenIFS+Home, 2017. 4.2.1

[33] Dibona cluster - MONTBLANC-3. http://montblanc-project.eu/prototypes,2018. 4.2.5.1

[34] Marenostrum IV - Barcelona Supercomputing Center. https://www.bsc.es/marenostrum/marenostrum, 2017. 4.2.5.1

[35] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: asystem for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016. 4.3

[36] Nitin A Gawande, Jeff A Daily, Charles Siegel, Nathan R Tallent, and Abhinav Vishnu.Scaling deep learning workloads: Nvidia dgx-1/pascal and intel knights landing. FutureGeneration Computer Systems, 2018. 4.3

[37] Moustafa Alzantot, Yingnan Wang, Zhengshuang Ren, and Mani B Srivastava. Rsten-sorflow: Gpu enabled tensorflow for deep learning on commodity android devices. InProceedings of the 1st International Workshop on Deep Learning for Mobile Systems andApplications, pages 7–12. ACM, 2017. 4.3

[38] E Griffiths. What is a model. Sheffield University, 2010. 4.3.1

[39] Vishakha Thamilalagan. Machine Learning what are machine learning models?, 2017. 4.3.1

[40] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, IanGoodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz,Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mane, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, IlyaSutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, FernandaViegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems,2015. Software available from tensorflow.org. 4.3.1

[41] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learningin TensorFlow. arXiv preprint arXiv:1802.05799, 2018. 4.3.1

83

https://pm.bsc.es/dlb

https://software.ecmwf.int/wiki/display/OIFS/OpenIFS+Home

https://software.ecmwf.int/wiki/display/OIFS/OpenIFS+Home

http://montblanc-project.eu/prototypes



mb3 d6.9 { performance analysis of applications and mini ......d6.9 - performance analysis of...

Documents