mixed mode programming on a clustered smp...

�

��

��

��

Mixed Mode Programming on a Clustered SMP System

Jake M. Duthie

September 12, 2003

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2003

Authorship declaration

I, Jake Martin Duthie, confirm that this dissertation and the work presented in it are my own achieve-ment.

1. Where I have consulted the published work of others this is always clearly attributed;

2. Where I have quoted from the work of others the source is always given. With the exception ofsuch quotations this dissertation is entirely my own work;

3. I have acknowledged all main sources of help;

4. If my research follows on from previous work or is part of a larger collaborative research projectI have made clear exactly what was done by others and what I have contributed myself;

5. I have read and understand the penalties associated with plagiarism.

Signed:

Date: September 12, 2003

Matriculation no: 9722273

Abstract

Clustered SMP Systems are becoming the architecture of choice for supercomputers in the HPC indus-try, and hence the question of how best to program for such machines is becoming more important.This project seeks to analyse one such programming style, the Mixed Mode model, which uses bothMPI and OpenMP in a single source to take advantage of the underlying machine configuration. Theprimary point of comparison will be with a Pure MPI version of the parallel implementation, whichis the programming norm for such systems. Four codes were used in the testing process: a programwritten specifically for this project based on a standard iterative algorithm; and three codes taken froman existing benchmark suite. In addition to a comparison of the execution times, hardware counters andother system tools were used where appropriate in order to develop a complete understanding of theperformance characteristics. The system used to gather all of the data for this project was an IBM p690Cluster.

In general, Mixed Mode was found to be a less efficient programming choice than Pure MPI, with theOpenMP threads encountering problems with both computational scalability and in making effectiveuse of the communication library. However, one Mixed code from the benchmark suite was able toobtain a performance improvement of 35% over its MPI version, because it employed overlapped com-munication/computation functionality, and was also able to replace explicit communications with directreads/writes to memory.

Contents

1 Introduction 1

2 Background 32.1 Clustered SMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Mixed Mode Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 The Jacobi Code 83.1 Algorithm Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Code Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Serial Code - The Computation . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.2 Parallel Code - The Communication . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 Hardware: The HPCx Service . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.1 Fixed Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.2 Scaling Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 ASCI Purple Benchmarks 554.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2 Codes Employed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 SMG2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.2 UMT2K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 sPPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.1 Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.2 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.4.1 SMG2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.4.2 UMT2K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4.3 sPPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Conclusions 775.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

i

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A Tabulated Data 81A.1 The Jacobi Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.1.1 Fixed Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.1.2 L3 Scaling Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85A.1.3 L2 Scaling Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.2 ASCI Purple Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92A.2.1 UMT2K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92A.2.2 sPPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

ii

List of Tables

3.1 HPCx cache design and hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Tested OpenMP code versions and their features . . . . . . . . . . . . . . . . . . . . . 323.3 HPM Cache/Memory data obtained from the simple array-addition code, for varying

total problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Initial SMG2000 run, performed on 4 LPARs with the same global problem size. Lefthand table shows Pure MPI times, and right hand shows Mixed. . . . . . . . . . . . . 60

4.2 Time spent in the OpenMP loop per iteration per process. Left table is for MPI, andright is for Mixed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.1 Fixed Problem; OpenMP; 1 LPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.2 Fixed Problem; MPI; 1 LPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.3 Fixed Problem; Mixed; 4 LPARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82A.4 Fixed Problem; MPI; 4 LPARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82A.5 Fixed Problem; Mixed; 8 LPARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82A.6 Fixed Problem; MPI; 8 LPARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82A.7 Fixed Problem; Mixed with 1 thread per process; 4 LPARs . . . . . . . . . . . . . . . 82A.8 Fixed Problem; OpenMP version 1; 1 LPAR . . . . . . . . . . . . . . . . . . . . . . . 83A.9 Fixed Problem; OpenMP version 2; 1 LPAR . . . . . . . . . . . . . . . . . . . . . . . 83A.10 Fixed Problem; OpenMP version 4; 1 LPAR . . . . . . . . . . . . . . . . . . . . . . . 83A.11 Fixed Problem; OpenMP version 5; 1 LPAR . . . . . . . . . . . . . . . . . . . . . . . 83A.12 Fixed Problem; Mixed version 2; 4 LPARs . . . . . . . . . . . . . . . . . . . . . . . . 84A.13 Fixed Problem; Mixed version 1; 4 LPARs . . . . . . . . . . . . . . . . . . . . . . . . 84A.14 Fixed Problem; Mixed version 2 with 1 thread per process; 4 LPARs . . . . . . . . . . 84A.15 L3 Cache fit with Collectives on; OpenMP; 1 LPAR . . . . . . . . . . . . . . . . . . . 85A.16 L3 Cache fit with Collectives off; OpenMP; 1 LPAR . . . . . . . . . . . . . . . . . . 85A.17 L3 Cache fit with Collectives on; Mixed; 1 LPAR . . . . . . . . . . . . . . . . . . . . 85A.18 L3 Cache fit with Collectives off; Mixed; 1 LPAR . . . . . . . . . . . . . . . . . . . . 85A.19 L3 Cache fit with Collectives on; MPI; 1 LPAR . . . . . . . . . . . . . . . . . . . . . 85A.20 L3 Cache fit with Collectives off; MPI; 1LPAR . . . . . . . . . . . . . . . . . . . . . 85A.21 L3 Cache fit with Collectives on; Mixed; 2 LPARs . . . . . . . . . . . . . . . . . . . . 86A.22 L3 Cache fit with Collectives off; Mixed; 2 LPARs . . . . . . . . . . . . . . . . . . . 86A.23 L3 Cache fit with Collectives on; MPI; 2 LPARs . . . . . . . . . . . . . . . . . . . . . 86A.24 L3 Cache fit with Collectives off; MPI; 2 LPARs . . . . . . . . . . . . . . . . . . . . 86A.25 L3 Cache fit with Collectives on; Mixed; 4 LPARs . . . . . . . . . . . . . . . . . . . . 86A.26 L3 Cache fit with Collectives off; Mixed; 4 LPARs . . . . . . . . . . . . . . . . . . . 86A.27 L3 Cache fit with Collectives on; MPI; 4 LPARs . . . . . . . . . . . . . . . . . . . . . 87

iii

A.28 L3 Cache fit with Collectives off; MPI; 4 LPARs . . . . . . . . . . . . . . . . . . . . 87A.29 L3 Cache fit with Collectives on; Mixed; 8 LPARs . . . . . . . . . . . . . . . . . . . . 87A.30 L3 Cache fit with Collectives off; Mixed; 8 LPARs . . . . . . . . . . . . . . . . . . . 87A.31 L3 Cache fit with Collectives on; MPI; 8 LPARs . . . . . . . . . . . . . . . . . . . . . 87A.32 L3 Cache fit with Collectives off; MPI; 8 LPARs . . . . . . . . . . . . . . . . . . . . 88A.33 L3 Cache fit with Collectives on; Mixed; 16 LPARs . . . . . . . . . . . . . . . . . . . 88A.34 L3 Cache fit with Collectives off; Mixed; 16 LPARs . . . . . . . . . . . . . . . . . . . 88A.35 L3 Cache fit with Collectives on; MPI; 16 LPARs . . . . . . . . . . . . . . . . . . . . 88A.36 L3 Cache fit with Collectives off; MPI; 16 LPARs . . . . . . . . . . . . . . . . . . . . 88A.37 L3 Cache fit with Collectives on; Mixed on 4 LPARs with 2 threads . . . . . . . . . . 88A.38 L3 Cache fit with Collectives on; Mixed on 4 LPARs with 4 threads . . . . . . . . . . 89A.39 L2 Cache fit with Collectives on; OpenMP; 1 LPAR . . . . . . . . . . . . . . . . . . . 89A.40 L2 Cache fit with Collectives off; OpenMP; 1 LPAR . . . . . . . . . . . . . . . . . . 89A.41 L2 Cache fit with Collectives on; Mixed; 1 LPAR . . . . . . . . . . . . . . . . . . . . 89A.42 L2 Cache fit with Collectives off; Mixed; 1 LPAR . . . . . . . . . . . . . . . . . . . . 89A.43 L2 Cache fit with Collectives on; MPI; 1 LPAR . . . . . . . . . . . . . . . . . . . . . 89A.44 L2 Cache fit with Collectives off; MPI; 1 LPAR . . . . . . . . . . . . . . . . . . . . . 89A.45 L2 Cache fit with Collectives on; Mixed; 2 LPARs . . . . . . . . . . . . . . . . . . . . 90A.46 L2 Cache fit with Collectives off; Mixed; 2 LPARs . . . . . . . . . . . . . . . . . . . 90A.47 L2 Cache fit with Collectives on; MPI; 2 LPARs . . . . . . . . . . . . . . . . . . . . . 90A.48 L2 Cache fit with Collectives off; MPI; 2 LPARs . . . . . . . . . . . . . . . . . . . . 90A.49 L2 Cache fit with Collectives on; Mixed; 4 LPARs . . . . . . . . . . . . . . . . . . . . 90A.50 L2 Cache fit with Collectives off; Mixed; 4 LPARs . . . . . . . . . . . . . . . . . . . 90A.51 L2 Cache fit with Collectives on; MPI; 4 LPARs . . . . . . . . . . . . . . . . . . . . . 91A.52 L2 Cache fit with Collectives off; MPI; 4 LPARs . . . . . . . . . . . . . . . . . . . . 91A.53 L2 Cache fit with Collectives on; Mixed; 8 LPARs . . . . . . . . . . . . . . . . . . . . 91A.54 L2 Cache fit with Collectives off; Mixed; 8 LPARs . . . . . . . . . . . . . . . . . . . 91A.55 L2 Cache fit with Collectives on; MPI; 8 LPARs . . . . . . . . . . . . . . . . . . . . . 91A.56 L2 Cache fit with Collectives off; MPI; 8 LPARs . . . . . . . . . . . . . . . . . . . . 92A.57 Initial 4 LPAR run for MPI and Mixed . . . . . . . . . . . . . . . . . . . . . . . . . . 92A.58 1 LPAR run for OpenMP, Mixed and MPI . . . . . . . . . . . . . . . . . . . . . . . . 92A.59 1 LPAR run for MPI with reduced tmax . . . . . . . . . . . . . . . . . . . . . . . . . 92A.60 1 LPAR run for Mixed with reduced tmax . . . . . . . . . . . . . . . . . . . . . . . . 92A.61 1 LPAR run for OpenMP with reduced tmax . . . . . . . . . . . . . . . . . . . . . . 93A.62 Final run on 2, 4, and 8 LPARs for MPI and Mixed . . . . . . . . . . . . . . . . . . . 93A.63 Mixed Mode run on 4 LPARs with varying process decompositions; 1 process per LPAR

and 8 threads per process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93A.64 Pure MPI run on 4 LPARs with varying process decompositions; 8 processes per LPAR 93A.65 1 LPAR run for OpenMP (8 threads), Mixed (1 process, 8 threads per process), and MPI

(8 processes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93A.66 Mixed Mode run on 1, 2, 4, and 8 LPARs, with 1 process per LPAR and 8 threads per

process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94A.67 Pure MPI run on 1, 2, 4, and 8 LPARs, with 8 processes per LPAR . . . . . . . . . . . 94

iv

List of Figures

2.1 Representation of a Clustered SMP System . . . . . . . . . . . . . . . . . . . . . . . 42.2 Representation of a typical Mixed Mode program’s flow of control through MPI and

OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Schematic kernel design for the Jacobi code . . . . . . . . . . . . . . . . . . . . . . . 103.2 Representation of regular data decomposition and communication in 2D . . . . . . . . 123.3 Schematic kernel design for the Jacobi code with MPI routines added for communication 133.4 The three decompositions available with the OpenMP codes. The left hand panel de-

scribes i-decomposition with versions 1 and 2. The centre panel shows j-decompositionwith version 3. The right hand panel shows one possible 2D decomposition using ver-sions 4 and 5, although other methods including 1D decompositions would be perfectlypossible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Schematic representation of the MPI/OpenMP hierarchy in the Mixed Mode design . . 183.6 Timer data for Pure OpenMP and MPI runs performed on 1 LPAR. Horizontal axis

displays the given process or thread layout in 2D . . . . . . . . . . . . . . . . . . . . 243.7 Timer data for Mixed (top) and MPI (bottom) runs performed on 4 LPARs. Horizontal

axis displays the given MPI process layout in 2D . . . . . . . . . . . . . . . . . . . . 273.8 Timer data for Mixed (top) and MPI (bottom) runs performed on 8 LPARs. Horizontal

axis displays the given MPI process layout in 2D . . . . . . . . . . . . . . . . . . . . 283.9 Representation of the process-topology for a 1D problem. Dashed blocks indicate LPAR

boundaries; solid squares are individual processes. . . . . . . . . . . . . . . . . . . . . 293.10 Representation of the process-topology for a 2D problem. Dashed blocks indicate LPAR

boundaries; solid squares are individual processes. . . . . . . . . . . . . . . . . . . . . 293.11 Timer data for Mixed runs performed on 4 LPARs with 1 thread per process. Horizontal

axis displays the given process layout in 2D . . . . . . . . . . . . . . . . . . . . . . . 313.12 Timer data for Pure OpenMP version 1 (top) and 2 (bottom) runs. Horizontal axis

displays the number of threads used . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.13 Timer data for Pure OpenMP version 4 (top) and 5 (bottom) runs. Horizontal axis

displays the given thread layout in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . 343.14 Timer data for Version 2 Mixed (top) and Version 1 Mixed (bottom) runs performed on

4 LPARs. Horizontal axis displays the given process layout in 2D. . . . . . . . . . . . 363.15 Timer data for a Version 2 Mixed run performed on 4 LPARs with 1 thread per process.

Horizontal axis displays the given process layout in 2D. . . . . . . . . . . . . . . . . . 373.16 Timer data for OpenMP for the L3 Scaling Problem Size on 1 LPAR. Horizontal axis

displays the number of threads used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.17 Timer data for Mixed (1 process; 8 threads) for the L3 Scaling Problem Size on 1 LPAR.

Horizontal axis displays the given MPI process layout in 2D. . . . . . . . . . . . . . . 40

v

3.18 Timer data for MPI for the L3 Scaling Problem Size on 1 LPAR. Horizontal axis displaysthe given MPI process layout in 2D. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.19 Timer data for Mixed (top) and MPI (bottom) runs on 2 LPARs. Horizontal axis displaysthe given MPI process layout in 2D. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43



3.22 Timer data for Mixed (top) and MPI (bottom) runs on 16 LPARs. Horizontal axis dis-plays the given MPI process layout in 2D. . . . . . . . . . . . . . . . . . . . . . . . . 47

3.23 Timer data for differing process/thread combinations for the Mixed code for 4 LPARruns, and a corresponding Pure MPI run. Horizontal axis displays the given MPI processlayout in 2D, along with the number of threads per process used if referring to a Mixedrun. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.24 Line graphs showing the total number of Main Memory Loads recorded on each MPIprocess for Mixed (1p x 8t) (top left), Mixed (2p x 4t) (top right), Mixed (4p x 2t)(bottom left), and MPI (bottom right) on 4 LPARs. . . . . . . . . . . . . . . . . . . . 51

3.25 Representation of the relationship between thread data locality and the MPI communi-cation pattern for a Mixed code. The shaded area indicates the data halos on the process. 52

4.1 Results of a 4 LPAR run for UMT2K with MPI (left) and Mixed (right) versions. . . . 624.2 Results of a 1 LPAR run for UMT2K with OpenMP (left), Mixed (centre), and MPI (right). 634.3 Results of a 1 LPAR run for UMT2K with MPI for 1, 2, 4, and 8 processors with one

process per processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4 Results of a 1 LPAR run for UMT2K with Mixed for 1, 2, 4, and 8 processors with one

process per LPAR and 1 thread per processor. . . . . . . . . . . . . . . . . . . . . . . 654.5 Results of a 1 LPAR run for UMT2K with OpenMP for 1, 2, 4, and 8 processors with

one thread per processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.6 Results of a 2, 4, and 8 LPAR runs for UMT2K with Mixed (one process per LPAR and

8 threads per process) and MPI (8 processes per LPAR) versions. . . . . . . . . . . . . 674.7 Results of a 4 LPAR run for sPPM with Mixed (top – one process per LPAR and 8

threads per process) and MPI (bottom – 8 processes per LPAR) versions. The differentMPI process decompositions used are displayed on the x-axis, grouped with 1D decom-positions on the left and 2D/3D on the right. . . . . . . . . . . . . . . . . . . . . . . . 69

4.8 Process topologies created for a 2×1×1 decomposition (top left), a 2×2×1 decompo-sition (top right), and a 2× 2× 2 decomposition (bottom) in sPPM. Numbers shown oneach individual cube indicate the rank of each process; note that rank 4 in the 2× 2 × 2cube is obscured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.9 Results of a 1 LPAR run for sPPM with OpenMP, Mixed and MPI versions. All runswere performed on 8 processors: OpenMP used 8 threads; Mixed used a process de-composition of 1 × 1 × 1 and 8 threads; and MPI used a process decomposition of2 × 2 × 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.10 Results of 1, 2, 4, and 8 LPAR runs for sPPM with Mixed (top – one process per LPARand 8 threads per process) and MPI (bottom – 8 processes per LPAR) versions. Thedifferent MPI process decompositions used are displayed on the x-axis. . . . . . . . . 75

vi

Acknowledgements

I would like to thank my project supervisors, Mark Bull and Lorna Smith, for their patient and endlesssupport throughout this project.

Chapter 1

Introduction

The most recent machine architecture trend in the High Performance Computing industry has beentowards Clustered SMP Systems. These are distributed memory systems, but with each node comprisingof a traditional shared memory multiprocessor (SMP) as opposed to a single processor. Such systemsare currently dominating the market due mainly to economic factors that facilitate their development;this can be witnessed by the fact that most of the machines in the Top 500 List [20] of the fastestsupercomputers in the world are presently Clustered SMP systems.

These machines pose an interesting question to software developers: how should such a system be pro-grammed? One standard approach, the Shared Variable Model, cannot in general be applied since thesemachines typically lack the necessary single address space needed for such a programming concept.This leaves the Message Passing Model, which can indeed be used across an entire Clustered SMP –it can treat the machine as a pure distributed system by ignoring the existence of the on-node sharedmemory. However, it is not clear whether this is the most efficient use of the underlying architecture.

An alternative to the Message Passing Model is the Mixed Mode Model, which uses Shared Variable pro-gramming for intra-node communications and Message Passing for inter-node communications. Whilstthis is more representative of the actual machine architecture, as it makes explicit use of the single ad-dress space available on each SMP node, it is not necessarily the case that such a programming modelgives the best performance on a Clustered SMP. In addition to the extra code development time neces-sary to produce a working Mixed Mode code (since Mixed Mode is in most cases harder to develop andmaintain than other models), it can also be quite difficult to produce a code that actually makes efficientuse of the system in question. Hardware and software features of the Cluster also come into play whenconsidering the performance of these codes, and these can influence the efficiency in unexpected ways.

This project seeks to investigate Mixed Mode programming on a Clustered SMP system (the HPCx Ser-vice [21], an IBM p690 cluster), focusing particularly on performance. The primary point of comparisonwill be with a pure Message Passing code, as these form the de facto standard for parallel programmingon most supercomputers; however, a limited comparison with a pure Shared Variable code will also beavailable inside a single node.

To this end, the project will compare Mixed Mode performance against other models in two separatecomponents. The first will involve an extensive analysis of a benchmark written by the author, based onan existing Message Passing code. The behaviour of the code’s computation and communication on the

1

HPCx service will be investigated with the intention of optimising the Mixed Mode code to its fullestextent, and subsequent to this, an in-depth study of the performance characteristics will be carried out.

The second component will entail using a suite of established benchmarks in both Mixed Mode andMessage Passing models, again looking at their comparative performance characteristics. This studywill be conducted at a higher level than for the benchmark written specifically for this project; giventhe increased level of complexity inherent in the supplied code, it will not be possible to gain an under-standing of every component of the benchmarks in the time available.

Performance analysis will consist of a straightforward runtime comparison between the different parallelmodels, both in terms of the overall execution time and also the time spent in particular sections of thebenchmark kernels where possible. In addition, hardware counters and communication tracer tools willbe employed where appropriate, in order to gain a fuller understanding of the underlying behaviour ofthe codes at the execution level. Utilising all of these together, it is hoped that a comprehensive study ofMixed Mode performance can be constructed.

The subject matter of this project is presented as follows. Chapter 2 provides an overview of the theorybehind Mixed Mode programming, clarifying precisely what the term means in the context of existingparallel models and providing an examination of some of the expected advantages and disadvantages asregards performance. A brief examination of Clustered SMP systems is also included, in order to set thediscussion of Mixed Mode programming against the architecture it is designed for.

Chapter 3 covers the benchmark written for this project, which is an adaption of a standard form of imageprocessing employing a simple Jacobi algorithm as its computational kernel. A brief examination of thealgorithm is first presented, followed by a more thorough description of the various parallel versionsdeveloped. After a description of the salient details of the HPCx service and the methodology exercised,the main body of results and analysis is presented covering all of the performance metrics used and anin-depth study of the Mixed Mode code.

Chapter 4 describes the established benchmarks used to supplement the Mixed Mode Jacobi studies ofChapter 3. These benchmarks are part of the ASCI Purple Benchmark Suite [23], and are representativeof typical scientific codes employed in HPC. Three codes from the suite were used, and these are eachdescribed separately in terms of their function and parallel implementation. A description of the higher-level benchmark methodology used to test these codes is then given, followed by the results and analysisof the three separate performance studies.

Chapter 5 presents the conclusions of the Mixed Mode programming review, and analyses the successof the project as a whole. Numerical data from all experimental runs is included in the Appendix, asalmost all data in the results and analysis sections will be presented graphically.

As Clustered SMP systems become more and more dominant in the HPC industry, interest in MixedMode programming is growing. It is hoped that this project can therefore make a useful contribution tothe growing pool of performance studies of this parallel model, in particular the effectiveness of such amodel when employed on the HPCx Service.

2

Chapter 2

Background

This chapter presents an overview of the underlying elements of this project. First, information onClustered SMP systems is presented. A short examination of the form of Mixed Mode Programmingis then conducted, along with some expected advantages and disadvantages of this style of parallelimplementation.

2.1 Clustered SMP Systems

Mixed Mode Programming, the focus of this project, is a form of parallel programming designed totake advantage of a particular supercomputer architecture: the Clustered SMP system. Such a systemis essentially a blend of what used to be the two dominant machine architectures found in the industrialHigh Performance Computing market: Distributed Memory systems; and Shared Memory systems. Theformer describes a machine consisting of individual processors, each with their own local memory,which communicate with each other via explicit message passing over some form of interconnect. Thelatter is a system where the individual processors all have access to the same global memory, allowingcommunications via direct reads/writes to this memory space. Shared Memory machines are oftendescribed as Symmetric MultiProcessors, or SMPs.

A Clustered SMP is constructed in a similar manner to a Distributed Memory machine, but it comprisesan assembly of interconnected SMP nodes rather than individual processors. All the processors on onenode have access to that node’s “global” memory, and communication between nodes is handled acrossa DM-like interconnect, as shown in Figure 2.1.

Such a system should in theory combine the advantages of both architectures. Distributed machinesare extremely scalable as new processors can simply be plugged into the interconnect; Shared mem-ory machines are handicapped in this regard as they require very sophisticated hardware to maintain aglobal memory space when hundreds of processors are involved. A Clustered SMP can therefore takeadvantage of DM-like scalability, as “processors” are simply replaced with “nodes” in this context.

Pure distributed systems have an obvious bottleneck however, as good scalability relies on a very fastinterconnect. If too many processors are trying to communicate at one time, the interconnect maybecome overwhelmed with message traffic and the efficiency of the system will suffer. Shared systems

3

P.E. P.E. P.E. P.E. P.E. P.E. P.E.P.E.

MEMORY

BUS


MEMORY

BUS


MEMORY

BUS

��

��

INTERCONNECT

Figure 2.1: Representation of a Clustered SMP System

do not have this limitation since direct reads/writes to memory require no such communication; however,this belies an even harder limitation in such machines as now processors must access the global memoryspace through a shared bus which is comparatively easier to overload than an interconnect. A ClusteredSMP trades these two bottlenecks off against one another, allowing 1000+ processor machines to bebuilt from small SMP nodes with neither the interconnect nor the buses having to scale very well. It canuse the shared nature of each of its nodes to relieve the pressure on its node-to-node interconnect, ason-node processor communication is replaced with direct memory access.

This situation is by no means ideal; it is rather the “economy” option, as it enables the construction ofa large machine with tolerable but not fantastic performance for a much smaller cost than a pure DMor SM system of comparable size. In addition, SMP units are often found as network servers in thecommercial and industrial sectors, making their development a priority for manufacturers. ClusteredSMP systems are therefore making use of technology already in existence, rather than requiring highlycustomised hardware; this makes them a more logical developmental step in the technology hierarchy.

Given these advantages, the HPC industry is favouring development in Clustered SMP Systems atpresent. Many of the most powerful machines, including the largest European academic facility –the HPCx Service [21] – and the computers built under the Accelerated Strategic Computing Initia-tive (ASCI) in the U.S. – for example ASCI White [22], or ASCI Purple [23] which is currently indevelopment– are all machines of this nature.

However, Clustered SMP systems are not the silver bullet of HPC architecture. There are some disad-vantages to such a system, owing primarily to its complexity when compared to Distributed or Sharedmachines. For example, the combination of both interconnect and memory behaviour can make per-formance analyses of a Cluster very difficult to understand, as their interaction and overlap may not beimmediately apparent. Also it can be hard to actually take advantage of the mixed architecture, as it of-ten requires both active participation on the part of the programmer and system software that can utilisethe shared memory and interconnect hardware in the most efficient manner possible. Mixed Mode Pro-gramming is a form of parallel programming that enables software writers to tackle the first of theseproblems, as long as programmers are prepared to handle the extra complexity involved.

4

2.2 Mixed Mode Programming

As there are two intrinsic architectures for parallel machines – DM and SM – so there are two mainmethods of programming for said machines. One is some form of message passing library, for examplethe MPI Standard [25], which maps to DM machines by modelling separate processes, each with theirown private memory, which communicate using explicit messages (comparable to a DM’s individualprocessor+memory nodes and their interconnect). Another is with direct distribution of work or datain parallel principally via compiler directives, using the OpenMP Standard for example [26]; this formof programming is usually paired with SM machines as the compiler distributes work across multiplethreads via the global memory.

The parallel between Mixed Mode Programming and Clustered SMP Systems now becomes apparent. Ina similar manner to the way a Clustered SMP is built like a DM machine but with large SMP nodes ratherthan individual processors, so a Mixed Mode code uses two or more parallel programming standardswith the intention of mapping separate methods to different sections of the Cluster.

One logical approach to Mixed Mode Programming is therefore to use message passing for inter-nodecommunication via MPI, and shared memory programming for intra-node work and communicationvia OpenMP. Other forms of Mixed Mode Programming exist, e.g. using data parallel programmingto either replace one standard or even to add another layer to the Mixed code, but this project will notfocus on these. Hereafter, it can be assumed that any reference to Mixed Mode Programming describesthe MPI+OpenMP model.

Even with this clarification, the definition of Mixed Mode still allows for many different approachesfor the implementation of the MPI and OpenMP, as noted in Rabenseifner [10] amongst others: forexample, the two models can operate at the same level in the code (i.e. with threads and processesworking alongside one another) and each one can be given control of sections of the program it is moresuited to. The main focus of this project is on applying MPI for course-grain parallelism (i.e. principaldata decomposition), and OpenMP for fine-grain parallelism on each MPI process (i.e. distributing thework of each process), as shown in Figure 2.2:

MPI_Init

MPI_Finalize

!$OMP END PARALLEL

!$OMP PARALLEL

Figure 2.2: Representation of a typical Mixed Mode program’s flow of control through MPI andOpenMP

This form equates directly to running one MPI process per SMP, and then having each process spawn asmany threads as there are processors per SMP; it is however possible that different machines may have

5

different ideal distributions of processes/threads. This style of Mixed Mode, with MPI communicationhappening outside parallel regions or localised to calls made from one thread (typically the masterthread) is identified as the masteronly style by Rabenseifner [10].

When writing code for a Clustered SMP system, one encounters the immediate restriction that OpenMPcannot be used across the entire machine – it is limited to within a single SMP node. Therefore the maindecision comes down to using Pure MPI when writing the code (which in effect treats the Cluster SMPsystem simply as a DM machine at the source code level; the hardware may of course be implementedin such a way as to make use of the shared on-node memory, but this would typically be hidden fromthe user), or to develop a Mixed code in order to allow the user to get directly involved in the sharedmemory side of things.

The main question is then whether or not Mixed Mode Programming yields any performance benefitover Pure MPI, and hence whether it is worth developers investing the time in writing such codes forClustered SMP systems. Given that Mixed Mode seems such a natural fit to Clustered machines, and theincreasing prevalence of such machines in the HPC industry, there has been some considerable effortinvested in such experiments already.

There are several theoretical reasons why Mixed Mode programs should be faster on Clustered SMPsystems than Pure MPI, aside from the purely intuitive comparison between architecture and code struc-ture. Two of the reasons are tied up in the hardware as mentioned earlier:

• Intra-node communication is replaced by theoretically faster direct reads/writes to memory, therebyeliminating the overhead of calling the MPI library.

• As a result of the use of the shared memory, pressure on the Interconnect is lessened as messagesnow need only be sent via it for inter-node communication in cases where the MPI library cannotitself make use of the underlying hardware. In addition, the use of OpenMP can result in lessaggregate data being sent across the interconnect, which may improve performance.

These advantages are discussed in Chow & Hysom [5] amongst others. In addition, Smith & Bull [1] andHenty [14] point to a few other reasons behind the theoretical superiority of Mixed Mode Programming:

• Essentially a corollary to the first point above, Mixed Mode code would reduce the performancehit from an MPI implementation that was unoptimised for intra-node communication.

• Performance issues relating to the algorithm may play a role, perhaps when shared-memory im-plementations are available which are more efficient than their MPI counterparts.

• In cases where the Pure MPI code uses replicated data structures across each process, a Mixedversion would permit said structures to reside only on each SMP and hence conserve memoryusage.

• Cases where the number of MPI processes is restricted (most often to powers of 2 numbers dueto the design of the code) or limited (due to the implementation of the MPI library itself beingunable to stretch beyond a certain value) would be circumvented to some extent as the number ofprocesses used in a Mixed code would be considerably less.

On top of these, developing Mixed Mode codes in the masteronly style is typically not much more ofan overhead than producing the (in many cases underlying) Pure MPI version, as reported in Bova et al

6

[9]; given this it appears that Mixed Mode Programming is indeed the way forward, at least on paper.However if a more complex OpenMP implementation is needed then this overhead would increase.

Despite all these proposed benefits to Mixed Mode Programming, works in the literature are, ironically,mixed in their reports of its performance. Cappello & Etiemble [7], Smith & Bull [1], Henty [14],and Chow & Hysom [5] all demonstrate that the Pure MPI codes outperform their Mixed counterpartsirrespective of the underlying architecture. In some cases, this difference can be quite marked [14].However, He & Ding [8] and Giraud [6] both report improved performance with Mixed codes; theformer even gets a factor of 4 increase using a Mixed code, but does go on to note that this is at variancewith much of the rest of the literature.

Rabenseifner [10] points to several problems inherent in Mixed Mode code that are not necessarily asobvious as the apparent benefits. These include:

• The inter-node communication bandwidth may be better utilised by the Pure MPI code whencompared to its Mixed masteronly counterpart. This is due in part to the Pure MPI code’s abilityto overlap its communication traffic on the interconnect, as the system may be better able to handlelots of small messages compared to a few larger ones as would be the case in a Mixed code.

• The masteronly style of Mixed Mode Programming has an inherent problem as it requires MPIcalls to occur from only one thread – this is a necessary sacrifice if the MPI implementationis not thread safe, as interactions between the two models will then be non-deterministic. Thisintroduces additional synchronisation overheads into the OpenMP, and results in the other threadson a node spinning doing nothing as the communicating thread carries out its MPI task(s). Inthe Pure MPI version, threads are replaced with processes that are actively participating in thecommunication sections of the code, which can lead to increased performance.

• The actual use of the OpenMP itself may cause additional overhead in addition to the extra syn-chronisation described above. Such a performance hit then becomes implementation dependantdepending on the machine in question.

Some of these problems can be avoided by investing more time in the development of the Mixed code,typically by using more complex decomposition at the OpenMP level. Whether or not such addedcomplexity is needed is unfortunately difficult to judge without spending the time developing “simpler”Mixed codes. However, in spite of the ambiguous nature of the theoretical benefits in terms of actualperformance figures seen, real industrial codes are being developed under a Mixed regime as seen inSalmond [4]. Benchmarks for the proposed ASCI Purple machine [23] also include Mixed functionalityin recognition of the proposed system’s Clustered SMP design.

7

Chapter 3

The Jacobi Code

This chapter outlines the main part of the Mixed Mode project: the benchmark based on the Jacobialgorithm. A brief introduction to the algorithm itself is first presented, followed by more extensivedescriptions of the various parallel implementations written. After a list of the experimental proceduresadopted for the benchmarking process, the results and analysis of the study are given in two sections:a test-case problem designed for tuning the Mixed Mode code; and then an in-depth study of a largerproblem.

3.1 Algorithm Theory

The algorithm chosen for the computational core of the primary code undergoing Mixed Mode studywas the inverse of a very simple edge-detection algorithm used for image processing; such a kernel istypical of regular domain decomposition codes with nearest-neighbour communication. Given an input2D M × N greyscale image, the “edge” pixel values (located at point (i, j)) can be built up from theindividual image pixels using the equation:

edgei,j = imagei−1,j + imagei+1,j + imagei,j−1 + imagei,j+1 − 4 × imagei,j (3.1)

Now, if the opposite approach is taken whereby one is given an edge-type input file and asked to generatethe original image from said file, it is possible to essentially run the above equation in reverse across theedge data. However, this reconstruction process is not exact, but rather requires the reverse calculationto be performed iteratively. This iterative algorithm then takes the form:

newi,j =1

4× (oldi−1,j + oldi+1,j + oldi,j−1 + oldi,j+1 − edgei,j) (3.2)

where edge is the edge input generated using Equation 3.1 above; old is the image value at the start ofthe current iteration; and new is the image value at the end of that iteration. This is one simple exampleof the standard Jacobi algorithm in 5-point stencil format, which is often used for benchmark analysesof this kind, for example in Hein & Bull [2].

8

Hence, reconstructing the image is quite a computationally intensive procedure. In addition, the algo-rithm operates over regular 2D input data sets (the edge-type input files), which lends itself naturallyto a parallelism approach using domain decomposition across multiple processors. This means thateach iteration will also require communication between processors since the algorithm has a “nearest-neighbour” form. The combination of these two elements therefore makes this algorithm a very suitablechoice for performing benchmark studies.

A second component in the reconstruction process has also been added. Whilst it is possible to simplyrun the algorithm for a fixed number of iterations in order to generate a reasonably correct image, it ismore representative of scientific-like codes if the current convergence of the new values is monitoredas well, in this case via a residual value. The iterative process can then be terminated when this residualfalls below a specified limit, indicating that the required accuracy has been achieved. The followingequation is used:

∆2 =1

MN×

i=M∑

i=1

j=N∑

j=1

(newi,j − oldi,j)2 (3.3)

and then ∆ =√

∆2 is calculated to get the residual value.

Whilst this is a more accurate example of a scientific code, it also allows access to another element ofthe process – global comms. Since each processor will have access only to its own subdomain of thewhole problem, ∆ can only then be calculated by first computing ∆2 locally, and then summing theresults over all processors. This is an important part of the benchmark, since global comms are usuallymore complex than simple point-to-point comms in terms of library implementation, and hence can havedifferent performance characteristics. In order to fully test the system, it is therefore necessary to utiliseboth forms of communication.

Finally, it should be noted that this is a somewhat overly-simplified method of performing this kindof image processing – considerably more sophisticated methods of reconstruction exist. However, themain goal of this piece of code is to perform some form of meaningful calculations in a manner thatis relatively straightforward to time, not to actually run any form of rigorous scientific analysis in andof itself. Essentially, as long as the program does not start to produce infinities or NaNs during itsrun (which are implemented in different ways on the hardware and affect the way computations areperformed, hence messing up any timings taking place), the actual mathematics behind the operationsis incidental.

3.2 Code Design

This report will only concentrate on the computationally intensive kernel design; large amounts of thecode are only used for I/O or various setup procedures necessary for a parallel implementation and henceare not relevant to this discussion except where discussed below. All code in this section was written inC.

9

3.2.1 Serial Code - The Computation

In serial form, the main iterative loop for performing the image processing is shown in Figure 3.1.

Algorithm

Delta

Update

Figure 3.1: Schematic kernel design for the Jacobi code

This loop is run for a fixed number of iterations, or when the residual falls below a specified value,whichever occurs first. There are three 2D arrays, old, new, and edge which are used throughout theiterative process. All three arrays are declared as floats, and are sized dynamically using mallocbased on the current M × N problem, where M and N are included as #define’s in the source. Allthree arrays have 1-element halos in both directions, to account for the nearest-neighbour part of theJacobi algorithm; only old actually requires them, but placing halos on all three makes coding up theproblem somewhat easier.

The three parts of the iteration loop have the following functions:

• Algorithm: This loop performs the Jacobi algorithm operation, as shown in this code fragment:

for(i=1;i<M+1;i++){for(j=1;j<N+1;j++){new[i][j]=0.25*(old[i-1][j]+old[i+1][j]+old[i][j-1]

+old[i][j+1]-edge[i][j]);}

}

This section of the code will have to read from four separate cache lines to complete the right handside of the algorithm, as old[i][j-1] and old[i][j+1] will most likely lie contiguouslyin memory; this piece of code is the most expensive in terms of computation.

10

• Delta: This loop performs the residual calculation:

for(i=1;i<M+1;i++){for(j=1;j<N+1;j++){deltasq+=(new[i][j]-old[i][j])*(new[i][j]-old[i][j]);

}}delta=sqrt(deltasq/(double)(M*N));

Both deltasq and delta are declared as double variables. The computational cost is lessthan that of the Algorithm loop in the serial code, as only three cache lines need be accessed forthe right hand side.

• Update: This loop updates the current image values in preparation for the next iteration, bycopying the data from new into old:

for(i=1;i<M+1;i++){for(j=1;j<N+1;j++){old[i][j]=new[i][j];

}}

This loop is clearly the cheapest in terms of computational cost, with only one read necessary onthe right hand side.

All three of these loops could in theory be optimised further (aside from running through the arrayindices in the order shown, ensuring memory access occurs contiguously) via additional loop fusion ormore complex forms of data tiling; however, the benchmark process can be carried out more accuratelyif each loop is kept distinct and is implemented in a relatively straightforward way, as it is easier todetermine what is taking place at the hardware level.

All three loops, and the main iteration loop itself, are individually instrumented with timer calls. Thechoice of which timer to use is dependent on the parallel implementation – see below. It is expected thatvirtually all of the time spent running the code will be expended inside the iteration loop, and as suchno other sections of the image processing code feature in the performance analysis.

3.2.2 Parallel Code - The Communication

Several different parallel versions of the code were developed, and each will be discussed below. ThePure MPI version forms the backbone for the Mixed Mode as well, and hence will be discussed first. Thedifferent implementations of the Pure OpenMP code relate to the way in which the work decompositionis handled across the different threads or with aspects of the OpenMP directive use, and are each dealtwith separately. Finally, the Mixed Mode versions incorporate parts of the OpenMP code into the PureMPI, and are covered at the end of this section.

11

Pure MPI

The two main alterations necessary in the kernel involve the addition of the communications, and al-tering the loop bounds to reflect the fact that each processor now has access only to its own part of thedecomposed problem.

The Pure MPI code was written to allow for regular 2D data decomposition, with the actual numberof processes in each dimension chosen at compile time by the user. This is typical of scientific codes,where the highest possible decomposition index matches the number of dimensions in the problem(which is typically three). Here, this 2D arrangement of processors is created using the standard MPICartesian functions, with an option to use reorder=TRUE for the communicator creation dependenton the hardware in question. All subsequent communication throughout the code takes place on thisCartesian communicator.

Instead of using the global problem size values of M and N , all of the double nested for loops in thethree stages outlined above run with limits imposed by local int problem size variables MP and NP .These are calculated using the standard method for rounding up integer division, namely:

MP = (M + Iprocs − 1)/Iprocs

where Iprocs is the number of processes in that dimension (the I simply indicates the convention thatthe i loop index counts across the M dimension). A similar equation determines NP . The three arraysused in the computational stage are then malloc-ed using these local size values plus halos. This 2Ddecomposition is quite suited to the problem in question, as the initial data itself is also very regular.There is then no danger of load-imbalance, and more complex forms of data decomposition do not haveto be considered in order to further optimise the MPI. The decomposition so created is shown in Figure3.2.

process = 2

process = 3

process = 0

process = 1

Figure 3.2: Representation of regular data decomposition and communication in 2D

Since this method will create arrays that are slightly too big (i.e. the aggregate size of all the local arrays

12

is larger than the actual global problem size) in cases where one or both dimensions does not divideequally into the number of processes in that dimension, additional checks have to be placed in the codeto ensure that the double nested for loops do not “run over the edge” of the problem space. The netresult of this is that processes lying at the high-index-value “ends” of the domain space have less workto do than others, which is preferable in terms of performance since the speed of the code is limitedby the slowest part; it is therefore better to have more processes doing more work whilst a few lie idle,rather than dumping all the extra computation onto the end.

With the communication routines added as well, the structure of the main iteration loop now appears asshown in Figure 3.3.

Update

Collective

Delta

Algorithm

Point-to-Point

Figure 3.3: Schematic kernel design for the Jacobi code with MPI routines added for communication

Turning to each of the extra components in turn:

• Point-to-Point: This part of the code performs the old arrays’ halo-swaps between neighbouringprocesses on the Cartesian grid, in advance of the nearest-neighbour reads in Algorithm. Eachprocess has a record of its four adjoining neighbours determined earlier in the code using thestandard MPI Cartesian library – processes record MPI_PROC_NULL for directions which lie offthe end the domain space.

The point-to-point communications were implemented using the MPI function MPI_Sendrecv.This routine allows the precise ordering and blocking nature of the messages to be determinedby the communication subsystem rather than by the programmer [25], and is usually the rec-ommended choice when performance is an issue. However, this is somewhat dependent on thehardware implementation available.

A process can therefore be involved in up to four communications in this section depending onits place in the domain decomposition. Two of these possible comms take place over contiguousparts of the old arrays (since this code is written in C, these correspond to sending “columns”

13

of data along sequential counts of the j index) and can hence be performed directly. In order toperform the non-contiguous (i “row”) halo-swaps, an MPI Derived Vector Datatype was createdearlier in the code for reading/writing the correct elements.

The combination of MPI_Sendrecv’s, and a Vector Datatype for non-contiguous communica-tions, should result in the Point-to-Point section of the MPI code being as efficient as possible, atleast in theory. The actual performance is however always dependent on the hardware implemen-tation.

• Collective: This section is essentially an additional component of the Delta loop. In the MPIcode, the Delta loop now sums up a double variable ldeltasq, which contains the local sumfor each processor. Before delta can be calculated, deltasq must first be determined fromthe sum of each of the ldeltasq values.

This is performed using the MPI_Allreduce function. It is necessary to compute delta onevery processor as the main iteration loop monitors the convergence of this value (hence the useof Allreduce); this could be performed by reducing the value onto a single processor and thenbroadcasting the result, but the use of MPI_Allreduce should be more efficient.

This “section” of the MPI is in fact only a single line of code in the source. However, because itis the only use of the hardware’s global communication implementation it is given greater promi-nence than might be apparent from an initial impression, and as such is directly instrumented withtimer calls in addition to the Delta loop’s.

One final difference in the MPI kernel is the addition of a compilation switch around the Delta loopand the Collective section. This was done in order to allow the collective communication section to be“turned off” if so desired; the Delta loop is included as well since without the global comm call thework it performs is meaningless. This option was included in the code for two reasons: one was simplyto allow for a cleaner test of the point-to-point communications on the hardware; the other was to testa synchronisation difference between the comms on and comms off versions, as the MPI_Allreducecall has an implicit barrier in its implementation - the only such barrier in the kernel.

Note that all six times recorded in the MPI code (one for each of the five sections detailed above,and one for the main iteration loop itself) are determined from calls to the MPI_Wtime function. Inaddition, the two timer calls made around the main loop are prefaced with calls to MPI_Barrier– this is a formal benchmark requirement to ensure that all processes begin and end the kernel at thesame synchronised point. Barriers are not enforced within the loop for the other timer calls, as thiswould introduce a considerable amount of unnecessary overhead. One further point worth mentioningis that every process makes these timer calls and calculates its own time spent on each part of the code(although the main iteration loop’s times should be virtually identical given the enforced barriers). Thiswas done in part to ensure an even workload, and also to allow for closer examination of the distributionof CPU time if so desired.

Pure OpenMP

Five different versions of the Pure OpenMP code were written, based on three different methods ofdecomposing the work over the threads, and two different approaches to designing the parallel construct.Since OpenMP requires no direct communication calls from within the source, the primary structure of

14

the kernel remains as shown in Figure 3.1, and the loops therefore still run over the global problem sizevalues of M and N . These versions were implemented as follows:

1. The simplest version of the OpenMP code just uses 3 parallel for directive calls, one perloop in the kernel, placed on the outermost (i) index in each - this provides for a 1D decompositionof the work in the M dimension. The main loop cannot itself be parallelised as each iterationdepends on data from the one before. The old, new, and edge arrays are all declared sharedbetween threads; the loop indices i and j are private to each thread; and deltasq in theDelta loop is a summed reduction variable.

2. A slightly more complex extension of the version above, the second Pure OpenMP code attemptsto minimise thread overhead by placing the entire iteration loop in one parallel region, andusing three for directives across the computation loops again over the i index. Variable scopesremain as discussed above, with the addition that the loop index variable controlling the itera-tions now must be private, and the delta value declared shared as it is used in the loop’sconvergence test on all threads.

Although this is somewhat dependent on the precise details of the OpenMP implementation, thisversion is expected to run slightly faster than the one above due to a saving in the software.Note that this is not likely to be due to threads being spawned and then killed three times foreach iteration in the first version (as might be assumed from the description), as most OpenMPimplementations will spawn their threads at the start of the execution run, and then simply spinthe spare threads until they are needed in the code. Instead the saving will most likely be due tothe reduced software bookkeeping needed with only one parallel region.

Two other features emerge from this code. One is that all timer calls must now be made on onlyone thread; this is pretty straightforward, as each call is performed under a master directive.This is the most efficient synchronising construct to use in this situation since it only requires aninternal check of the thread ID, and does not force an OpenMP Barrier at the end of the mastersection; other constructs require user-invisible variables to be mutex-locked or similar, whichslows the code down unfairly. Another use of master is made to protect the update of deltaafter the Delta loop has completed.

The second feature is more subtle. The first version contains three implicit synchronisation bar-riers, one per parallel for directive. With this code, there is a potential option to removethe barrier on the Algorithm loop using a nowait clause (which cannot be used on parallelfor directives, but can on simple for’s). Doing this relies on each thread retaining “control” ofits own section of the new and old arrays between the Algorithm loop and the Delta loop, sincethe operation of the Delta loop only requires that the values of new local to each thread havebeen written from Algorithm. Such behaviour is only guaranteed when using the decompositionschedule static, as the iteration to thread mapping will always be known. However, this codehas been written to use the default schedule (i.e. no schedule specified), and as such the requiredbehaviour cannot be assured (this concern could obviously be retro-actively addressed by simplyspecifying the schedule where necessary).

Note that even if the first barrier is removed in this way, the other two loops cannot run undernowait. The reduction of deltasqwill force synchronisation at the end of the Delta loop, andthe results of the Update loop must be synchronised before the next iteration of Algorithm sinceit requires nearest-neighbour reads that are certain to require access to other threads’ data.

15

3. The third version is again a small variation on the three parallel for’s theme. In this code,the three directives are replaced with plain parallel’s instead. Work decomposition withineach parallel region is then carried out by placing for nowait directives over the inner loopsin the double-loop nests. This forces work decomposition over the j index in contrast to the i de-composition employed in the first and second versions; this was done in order to investigate which“way up” OpenMP prefers its 1D decompositions to lie over. The nowait clauses are necessaryin order to prevent synchronisation at the end of each iteration of i, which would otherwise incura vast amount of unnecessary overhead.

Other variable scopes remain the same – reduction clauses can be specified on paralleldirectives as well as for’s, so this does not throw up any problems at this stage. Note that asingle parallel region version of this decomposition was not created; the reasons for this willbe covered in Section 3.4.

4. The fourth OpenMP code is more complex than the previous versions, as it is designed to allowfor a 2D decomposition of the work across the domain. In order to do this, a separate functionwas written which allowed the user to specify the number of threads required in each dimension;the function then determined what loop limits each thread should have in order to match this ar-rangement based on the problem size. This procedure was very similar to the data decompositionemployed in the Pure MPI code, including the standard of giving domain-edge threads less workto perform.

These thread-specific limits were then stored in four arrays (one for each of the upper and lowerbounds of both the i and j indices), and were indexed using the ID number returned from theomp_get_thread_num() function in order to allow private access for each thread. Thesefour arrays were therefore declared shared, as only the ID-index must remain private.

Within the kernel itself, explicit limits on the i and j loops were then replaced with referencesto these arrays. This version was again implemented with three separate parallel regions (nolonger parallel for’s, as the decomposition has now been performed manually), and othervariable scopes remain the same. Again, the reduction of the deltasq variable can be specifiedon the enclosing parallel directive.

As an aside, it should be noted that this form of decomposition could be performed with nestedparallelism; however, the system available for the project did not support this at the time.

5. The final OpenMP version was again a slightly optimised version using only one parallelregion – only this time the starting point was the 2D decomposition code. The variable scopesremain as discussed for versions 2 and 4 above where appropriate, with one important exception.

Lacking a suitable directive to bind to, the reduction of deltasq must now be performed byhand. The scalar variable is replaced by an array indexed by the thread ID number in an identicalmanner to the loop limit arrays. Each element of the deltasq array then contains the portionof the sum calculated for that thread once the Delta loop is complete, and the final sum is thenperformed on the master thread after an explicit barrier call.

A second explicit barrier must also be added after the Update loop has run, in order to ensureall data is ready for use before the next Algorithm loop. Note that this version of the code can makesafe use of the lack of a barrier between Algorithm and Delta, since the manual decompositionensures that each thread retains control of the same sections of the new array.

16

In addition to the slight saving from having only one parallel region, this version may showa slight speedup to version 4 based on fewer calls made to omp_get_thread_num(); version4 must make three such calls every iteration, whereas 5 need only perform one before the mainiteration loop commences.

In summary, the various different OpenMP versions allow for the following work decompositions to betested, as shown in Figure 3.4.

i

j

Figure 3.4: The three decompositions available with the OpenMP codes. The left hand panel describesi-decomposition with versions 1 and 2. The centre panel shows j-decomposition with version 3. Theright hand panel shows one possible 2D decomposition using versions 4 and 5, although other methodsincluding 1D decompositions would be perfectly possible.

Two other features of the OpenMP design are common to all versions. One is that, lacking access tothe MPI library, all timer calls in the Pure OpenMP codes are made to gettimeofday. There is aslight issue regarding the relative accuracies of MPI_Wtime and gettimeofday, but the results ofthis chapter will not quote times to enough significant figures for this to be a problem.

The second feature is the addition of a compiler switch around the Delta loop in all versions of thecode. This allows for comparable runs between “no collective comms” MPI and “no collective threadoperations” OpenMP to be made. Note that synchronicity differences are not an issue here as they werewith Pure MPI, as all the different OpenMP versions contain explicit or internal barriers somewhere inthe kernel.

Mixed Mode

Two different versions of the Mixed Mode code were developed. Both build on the Pure MPI version,so the code structure of the Mixed Mode kernel is as shown in Figure 3.3, along with the Mixed loopsrunning over local MPI process sizes MP and NP and checking for domain-edge processes. Also, theMixed code uses MPI_Wtime for all its timer calls.

The two versions were built using different versions of the Pure OpenMP code as templates; however,both conform to the masteronly style of Mixed Mode programming. As was discussed earlier, the twoparallel styles are being used at different levels in both codes, with the MPI performing the primary(coarse-grained) data decomposition, and the OpenMP is then spawned on each process to carry out thesecondary (fine-grained) work decomposition. This interaction between the MPI and the OpenMP canbe modelled as shown in Figure 3.5.

17

process 0

MPI

process 1 process 2 process 3

OpenMP OpenMP OpenMP OpenMP

thread 0

thread 1

thread 2

thread 0

thread 1

thread 2

thread 0

thread 1

thread 2

thread 1

thread 2

2D Array

thread 0

Figure 3.5: Schematic representation of the MPI/OpenMP hierarchy in the Mixed Mode design

Under the masteronly style, all MPI function calls must be made on only one thread. This is often amajor issue in Mixed Mode programming, as the available implementation of the MPI library itselfmay not be thread safe; this necessitates isolating MPI calls in OpenMP synchronisation constructsfor their protection. Also, overlapping thread use with MPI calls is often a considerable programmingheadache, as all threads in the programme must possess different thread ID’s and be aware of theirthread neighbours both within their own process and on neighbouring processes.

Keeping the Mixed code in the masteronly style is not only simpler to implement, it is also more repre-sentative of the form Mixed Mode scientific codes are likely to take – it is considerably more likely for aprogrammer to sprinkle a few OpenMP directives into a functioning MPI code than to gut said code andredesign the entire communication hierarchy to allow for inter-thread comms. Both codes take differentapproaches to this design style, as discussed below:

1. A Mixed Mode code based on Pure OpenMP version 4. This was chosen for two reasons: the2D decomposition of the OpenMP threads in addition to the 2D data decomposition of the MPIallows for the greatest amount of flexibility when it comes to determining the most efficientthread/process combination for the Mixed code; and the fact that version 4 uses three separateparallel regions means that no problems arise with calling MPI from multiple threads – exe-cutions performed outside the parallel regions are by default run on the master thread alone.

Note that this Mixed Mode code has three distinct OpenMP barriers, one per parallel region,and that the Delta loop’s reduction now takes place on ldeltasq – this has the effect of perform-ing the global sum on each SMP node first via the OpenMP, and then making the final inter-node

18

sum with MPI. This does again appear to be a more logical use of the parallel programming forms,given the underlying system architecture.

2. A Mixed Mode code based on Pure OpenMP version 2. This code was developed after testshad been performed on the Mixed version described above; this may at first seem somewhatcounter-intuitive as this version only allows for a 1D thread decomposition. The reasons behindthe development of this Mixed version will become clear in Section 3.4.

Since this now only has one parallel region enclosing the entire kernel, more care needs tobe taken with the OpenMP/MPI interaction. All MPI function calls (the entire Point-to-Point andCollective sections, and all timer calls) are now made under master directives – a clear demon-stration of why this is called the masteronly style of Mixed Mode programming. As discussed inthe Pure OpenMP code design, master is the most efficient synchronising construct to use inthis case.

This requires the addition of an explicit OpenMP barrier into the code between the Point-to-Point and Algorithm sections, in order to ensure that all halos have been updated before thethreads begin their Algorithm work. The reduction in the Delta loop forces a second barrier, andthe Update loop requires a third as the old array must be updated before the halo-swaps aremade. However, the Algorithm loop still runs with a nowait clause despite the inherent dangerof a non-conforming thread implementation.

As noted, both versions of the Mixed code have a considerably higher degree of synchronisation thanthe Pure MPI, as even with the Delta and Collective sections turned off, the Mixed code exhibits twoOpenMP barriers in either version. This is a necessary trade-off under this style of Mixed Mode; tocircumvent it, one must redesign the comms so that threads can “pretend” to be processes during com-munication as discussed above.

Both versions were built into the underlying Pure MPI code in such a way that all OpenMP-relatedadditions could be switched off at compile time; in effect, a single source acts as either Mixed or PureMPI depending on the settings in the Makefile. This was done in order to ensure the reliability ofbenchmark comparisons between the two parallel styles – with only one source, differences in runtimemust then be due entirely to the way the OpenMP and/or MPI interact with the hardware, which is thecentral point of study of this project.

Unfortunately, given the large amount of MPI calls needed in the setup stage, it was not possible toeasily maintain a Pure OpenMP option from the same source as well. As such, it should be noted thatall data presented for Pure OpenMP runs is not gathered from an otherwise exactly identical code to theMixed/MPI; any factors that arise from this difference will be discussed in Section 3.4.

19

3.3 Methodology

Here, the three core components of the experimental process are discussed separately. First, the technicaldetails of the machine used are covered, followed by a description of the software employed. Finally,the experimental standards used throughout this chapter are presented.

3.3.1 Hardware: The HPCx Service

All data recorded in this project (including the subsequent chapter on the ASCI Purple Benchmarks)was obtained from runs on the National HPCx Service [21]. This system consists of a cluster of 40 IBMp690 SMP frames, each containing 32 processors for a total of 1280 processors across the machine, anddelivering up to 3.2 Tflop/s sustained performance.

Per frame, the 32 processors are subdivided into 4 Multi-Chip Modules (MCM), each with 8 processors.Each MCM then contains 4 chips, with 2 processors per chip. The processors used are all IBM Power4s,clocked at 1.3 GHz. The cache hierarchy of the MCM units reflects this progressive division: eachprocessor has its own Level 1 cache (separate instruction and data caches); each chip then has its ownLevel 2 cache, shared between the 2 processors; and finally each MCM has its own Level 3 cache, sharedbetween the 8 processors. The actual cache sizes are given in the table below:

Level Organisation CapacityL1 Data Two-way, 128-byte line 32 KB per processor

L2 Eight-way, 128-byte line 1440 KB per chip; 720 KB per processorL3 Eight-way, 512-byte line 128 MB per MCM; 16 MB per processor

Table 3.1: HPCx cache design and hierarchy

Each frame then has 32 GB of Main Memory, shared between its 4 MCMs. Each MCM is connectedto the Main Memory, and to the other MCMs, via a 4-way bus interconnect, making for a 32-way SMP.Communication between frames is handled via IBM’s SP Colony switch.

The most important feature of HPCx’s SMP node hardware is that the system is presently configured tooperate every MCM as a distinct Logical PARtition (LPAR), each with its own copy of the operatingsystem. The Main Memory on each frame has also been subdivided to match this partitioning, withevery LPAR then having 8 GB of dedicated memory. This has been done to increase the communicationsbandwidth across the Colony Interconnect. This means that the overall nature of the system appears tothe user as a Cluster of 160 8-way SMP nodes.

3.3.2 Software

Each LPAR of HPCx runs its own copy of the IBM Unix operating system, AIX 5.1D. Actual runs onthe system are handled through a batch queue controlled with IBM LoadLeveler. Critically, given theshared nature of the caches and memory between processors on an LPAR, LoadLeveler ensures thatonly one user application can be running on an LPAR at any given time.

Since the Jacobi code has been written in C, compilation was performed using version 6.0.0.2 of IBM’s

20

xlc compiler. Specifically, the Pure OpenMP codes were compiled using xlc_r, and the Mixed Modeand MPI codes used mpcc_r. The r suffix indicates that thread-safe code should be generated, andis vital for use with OpenMP. The mp prefix indicates that the code contains MPI, and includes anynecessary libraries automatically. For C programs, note also that int and float variables are storedas 4 bytes, and double’s as 8 bytes.

The version of OpenMP for C available on HPCx was 1.0. Some functionality of MPI 2.0 has been im-plemented, but since the Jacobi code makes no use of any MPI-2 features, the code essentially ran underMPI 1.2 which is the fully supported standard on HPCx. Also, note that the Cartesian Communicatorbuilt in the MPI code was constructed using reorder=FALSE. HPCx has a very inefficient TRUEalgorithm implemented in its MPI library, and as such it is better for the communications if a FALSEgrid is used. This also means that it is easy to identify the physical location of the MPI processes, asLoadLeveler institutes a “block” approach to process allocation (processes 0 to 7 go on the first LPAR,8 to 15 on the second etc.).A TRUE grid completely disrupts this ordering, making detailed performanceanalysis considerably more difficult.

Two particular features of the MPI implementation on HPCx are worthy of note here. The first is themessage Eager Limit, set via an environment variable. The value of the Eager Limit sets a messagesize above which messages are sent between processors in a slower fashion (they require a rendezvousprotocol) compared to those of a size below this value (which are sent immediately). For all runs madethroughout this project, the Eager Limit was explicitly set to its highest value to ensure the fastestpossible communications were taking place wherever possible.

The second point about the MPI implementation is with regards to point-to-point communications withinan LPAR. By default on HPCx, an environment variable has been set to allow on-node processors tocommunicate with each other through the node’s shared memory, rather than passing messages throughthe interconnect. This is clearly an advantage in terms of performance, and has been used at all timesthroughout this project. Note that the situation with global communications is rather more complex, withthe implementation most likely instituting some form of tree algorithm to assemble a communicationpicture – it is unclear whether this benefits from access to a node’s shared memory.

In addition to the standard C math library (needed for the sqrt function in Delta amongst other things),the following compilation options were used:

• -q64 This enables 64-bit addressing, and also allows for better memory management throughoutthe program.

• -qarch=pwr4 -qtune=pwr4 These specify instruction set architecture and bias optimisa-tion for a Power4 system.

• -O3 This indicates that Third Level Optimisation should be performed on the code. This is amoderate level of code optimisation, performing some software pipelining and source manipula-tion amongst other things. A higher level of optimisation was experimented with but not generallyenforced – such use of the increased optimisation is indicated in Section 3.4.

Note that the optimisations performed with -O are essentially serial in nature. Given the nature ofthe Jacobi code’s kernel, it is therefore expected that most performance related features will fallout of the parallel implementation chosen, and not the compiler optimisation used.

In addition, the Mixed and Pure OpenMP codes require:

21

• -qsmp=omp:noautoThis includes the use of OpenMP in the code. The noauto instructs thecompiler to only use OpenMP threads where indicated in the source code via explicit directives;without this option enabled, the compiler may add in extra loop-based OpenMP if it feels it wouldbe beneficial. In order to switch the Mixed/MPI source from one parallel style to the other, allthat need be done is comment out this flag in the Makefile.

3.3.3 Experimental Procedure

In order to obtain accurate timings, every run was performed five times and the results averaged to givetimes and standard deviations for each section. Note that the three parallel versions of the Jacobi codeall have different amounts of timer data that get taken into consideration during this averaging process:the MPI processes each output their own recorded times for each section in the kernel in addition to theoverall kernel-time; no such action is performed with threaded code however, as only the master threadmakes calls to the timer function.

To give a concrete example, for a run of the Jacobi code performed on 1 LPAR the Pure OpenMP codewill output one time for each of the six required values. The Pure MPI code will output eight times foreach required value, one per processor. The Mixed code will output as many times for each requiredvalue as there were MPI processes assigned to it. So it is important to remember that despite all of theparallel runs being performed five times, the more MPI processes there were present in the run the largerthe sample space of times behind each average value actually is.

Various different M × N problem sizes were used for gathering the data; the specifics of each aredescribed in the upcoming Section (3.4). Irrespective of the problem size however, the code was alwaysrun for a fixed number of iterations by choosing an unattainably low required value for the convergencetest – this ensures that the workload is always the same. The primary reason behind the inclusion of theDelta loop is therefore as a test of the collective communication implementation.

22

3.4 Results and Analysis

This section presents all of the results obtained from the benchmarking studies of the Jacobi code.First, a test problem case used mainly for code development is described. Second, a more extensivelybenchmarked problem is presented, including in-depth studies of particular features of the code. Finally,a brief summary concludes the section.

3.4.1 Fixed Problem Size

The first problem chosen for experimentation was fixed to have M×N equal to 1000×1000 irrespectiveof the number of processors allocated. This gives a total global problem size of approximately 12 MB (310002 float arrays plus halos). The code was set to run for 10000 iterations by fixing the convergencetolerance to a value of 0.01; after 10000 iterations this problem has only converged to a residual of0.212374 so this ensures the same amount of work is being performed for all runs. This residual valuewas also used as a sanity check of the code to confirm that everything was working properly, since itshould always reach the same answer no matter the parallel implementation chosen.

This set of runs were used principally to test all of the parallel codes described in Section 3.2 anddetermine which version, or combination of versions, gave the best performance. Once this had beendetermined, the code choices for the Pure OpenMP and Mixed codes were fixed before proceeding tothe more extensively tested problem sizes detailed later in this section.

The results are presented in the order in which they were collated; this may appear at first to be asomewhat counter-intuitive use of the available parallel codes, but this best represents the stages ofdevelopment that they progressed through.

Pure Codes

The first runs performed were on 1 LPAR using the Pure MPI code and Version 4 (2D; 3 parallel regionsper iteration) of the Pure OpenMP code. This data is of critical interest, because in order for a Mixedcode to stand a chance of outperforming a comparable Pure MPI code, the OpenMP must demonstratesuperiority at the intra-node level.

The results are presented as histograms, with separate stacked components for each of the main sectionsin the kernel (recall that the OpenMP and Mixed/MPI kernels are different, as the latter includes twospecific communication sections). This presentation format will be adhered to for all performance datapresented in this section. These graphs are shown in Figure 3.6.

Tabulated data is presented in the Appendix (A.1.1) along with the standard deviations for each section;note however that the raw data gives a value for the total runtime rather than an “other” section as shownin the graphs. Here, “other” refers to time spent in the kernel outwith any of the principal sections, andincludes overhead relating to timer use, amongst other things. Standard deviations were not plotted onthe histograms purely in the interests of clarity, as they would often appear to overlap and hence obscuretheir meaning.

Runs were performed on 7 and 8 processors in order to confirm whether the operating system of an

23

Figure 3.6: Timer data for Pure OpenMP and MPI runs performed on 1 LPAR. Horizontal axis displaysthe given process or thread layout in 2D

24

LPAR was content to share a processor with a Jacobi thread/process, or if better performance couldbe obtained by leaving one free (other work in this area has suggested that the latter can be true insome circumstances – see Hein & Bull [2]). All possible thread or process configurations were tested,in order to determine which one gave the best performance; the numbers shown on the x-axis of eachgraph correspond to the M × N directions respectively; note that in the tabulated data the labels I andJ are used to denote process/thread allocation in a given direction, where I refers to the “axis” parallelwith the M direction and similarly for J and N .

Specific features of the OpenMP performance worthy of note are:

• As the chosen thread geometry moves further away from a 1D decomposition over the M direction(outer loop), the performance gets dramatically worse in all code sections. This is expected,since in C data is stored contiguously over the innermost array index (corresponding to the Ndirection here). As the decomposition moves from “columns” to “rows” (see Figure 3.4) cacheinvalidations will increase as cache-lines will now break across the rows rather than following theshape of the columns. This leads to the phenomena of False Sharing, whereby processors workingwith data at the edges of their domains will be sending invalidates to their neighbours, and viceversa, much more frequently with a row-like decomposition like 1x8 where the processors accesspotentially hundreds of lines that lie on two threads. In a column-like decomposition like 8x1,such invalidations will only only happen twice at most on a processor (where cache-lines breakonly at opposite corners of the decomposition).

• 8x1 is faster than 7x1, although the relative speedup is quite poor (i.e. the 8-thread runtime isnot 7/8 times the 7-thread runtime, which would be expected for a perfect scaling code). For thepurposes of the continued benchmarking of the Jacobi code, the former is the more importantresult as it provides motivation for always actively using all 8 processors in a node when runningthe Mixed codes, since the focus of interest here is in obtaining the best performance on themachine. From this data, it is not really possible to say whether the poor scalability is comingfrom the operating system demons sharing a processor with a code thread, or whether the codeitself simply does not scale well.

• Comparing the separate sections of the Pure codes, we see that the Algorithm and Delta loopsboth run faster in the OpenMP code, whilst the Update loop prefers the MPI implementation.The Delta loop is a clear indication of the advantages of using the shared memory to circumventdirect communication calls since the OpenMP version is performing the operations of the MPI’sCollective section as well. This means that the single Delta section in the best OpenMP code is20% faster than best comparable MPI sections. By a similar token, the Point-to-Point section iscontained in both the Algorithm and Update OpenMP sections (from off-cache reads and write-invalidates respectively), so overall these run 6% faster with OpenMP. This is all good news froma Mixed code standpoint.

Turning now to the MPI data:

• Again, the 7-process runs are slower than the 8, but the relative scalability appears quite poor.This means that all further MPI runs will be made using the full 8 processors per node as above.

• The Collective section of the code appears completely insensitive to the chosen process decom-position, which is not surprising given that no inter-node comms are taking place and hence thecollective operations are not really being tested yet. The computation sections all show some

25

favour towards having more processes aligned in the M direction as opposed to the N , but thisdifference is nowhere near as pronounced as for the OpenMP (indeed, 4×2 seems to be preferredto 8 × 1). This possibly comes from compiler optimisation effects, as -O3 may have an easiertime performing loop optimisations like unrolling on double loops which have a longer internal(N ) loop than external (M ) loop.

• The stand-out change in performance from process decomposition comes from the Point-to-Pointsection, with the 8 × 1 running 125% faster than the 1 × 8. Contiguous memory is again theculprit of this difference: a 1D topology that divides across the I direction only has to sendcontiguous “columns” of data around and copy them into the halos of neighbouring processors.The inverse decomposition, attempting to send “rows”, must first copy the correct elements into a1D buffer and then unpack them again after the communication has completed. Despite the use ofan MPI Derived Datatype for the latter form of send/receive, this process is still much slower. Thisprovides another reason for the MPI to favour the same decomposition strategy as the OpenMPcode.

In summary, these results appear to be quite promising. Whilst the OpenMP has the potential to givequite poor results for particular thread geometries, the best thread geometry (8x1, corresponding to thedecomposition that would be obtained using worksharing for directives over the outer M loops in thekernel) outperforms the best Pure MPI result by around 10%. This bodes well for the upcoming Mixedcode, as these results imply that its OpenMP sections should run faster than the MPI.

Mixed vs. MPI

Here, Mixed code version 1 (2D MPI; 2D OpenMP; 3 parallel regions per iteration) was used as thecomparison against the Pure MPI code; based on the data gathered from the OpenMP studies above, theOpenMP decomposition was fixed at 8 × 1 for all runs. Note that at this stage in the project, version 2of the Mixed code had not been written yet – see later.

Runs were performed on 4 and 8 LPARs for various process decompositions. For the Mixed runs, allpossible process-decompositions were examined; for the MPI however, the 2×/×2 were ignored as itwas expected that they would show very similar behaviour to the comparable 1D layout. The histogramdata is presented in Figures 3.7 and 3.8, with tabulated data given in the Appendix (A.1.1) as before.

These results clearly show that the Pure MPI code is on the whole a better choice than the Mixed. For4 LPAR runs, the MPI code outperforms the Mixed by around 18% for the best decompositions. With8 LPARs the situation is not as clear-cut; the MPI times do appear to be slightly faster, but the Mixedand MPI results overlap within one standard deviation of each other and hence cannot be meaningfullydistinguished. However, comparing the runtimes between the 32 and 64 processor data sets, it appearsthat the problem size is now too small on 64 (equivalent to only around 190 KB per processor) to benefitfrom runs this large.

In almost all cases, all sections bar the Collective run faster under the Pure MPI code. Considering thesein turn:

• The Point-to-Point section in both codes favours 1D decompositions over 2D, and favours thelong-N decomposition over the M . The latter is again explained by the communications beingfaster without the need to construct contiguous blocks of data for sending as described in the above

26

Figure 3.7: Timer data for Mixed (top) and MPI (bottom) runs performed on 4 LPARs. Horizontal axisdisplays the given MPI process layout in 2D

27

Figure 3.8: Timer data for Mixed (top) and MPI (bottom) runs performed on 8 LPARs. Horizontal axisdisplays the given MPI process layout in 2D

28

section. The former is a little more complicated. For the Pure MPI codes, the 2D arrangements areslower because of the underlying process-to-LPAR structure. For example, 32× 1 decompositionappears as repeated blocks of the following structure, as shown in Figure 3.9.

Figure 3.9: Representation of the process-topology for a 1D problem. Dashed blocks indicate LPARboundaries; solid squares are individual processes.

For this decomposition, only one inter-node communication is necessary between LPARs; all theremaining point-to-point comms take place through the shared memory in the MPI library (recallthat a block allocation of processes to LPARs is performed by the system in this case, and thatintra-node comms use shared memory by default). Since inter-node comms must travel across theinterconnect, minimising this traffic should improve performance. The situation for the 1 × 32decomposition is similar, although now all communications are non-contiguous.

By way of contrast, the 8 × 4 decomposition would appear as blocks of a different nature, asshown in figure 3.10. Here it is now necessary for 4 inter-node communications to take place perLPAR, which should therefore slow the point-to-point comm section down.

Figure 3.10: Representation of the process-topology for a 2D problem. Dashed blocks indicate LPARboundaries; solid squares are individual processes.

The situation regarding the Mixed code’s Point-to-Point section is not as clear. It is not yet pos-sible to make any clear discussion as to why the Mixed code runs slower with a 2D MPI processlayout, since all communication traffic happens over inter-node boundaries in the “1 process perLPAR” model; given this, one would therefore expect the fully non-contiguous send/receive lay-out to give the worst performance as this should be the only contributing factor.

In terms of a comparison between the Mixed and Pure MPI codes, it can be argued that the Mixedmessages should take longer to travel across the interconnect since they will be larger than thePure MPI’s; however, there will be fewer of them. Also, on 4 and 8 LPARs one would expect allmessage sizes (be they Mixed or MPI) to lie below the Eager Limit, which suggests that the fewerthe messages the faster the communication time. This is once again contrary to the performancefigures obtained, and cannot be satisfactorily explained at this stage.

• In stark contrast to the complexity of the Point-to-Point analysis, the Collective section is muchmore straightforward to explain. Since there are more processors involved in global communi-

29

cations in the Pure MPI model, one would expect its Collective section to take longer. This isclearly borne out in the performance data.

• All three computation loops run slower in the Mixed code. This might have been expected due tothe inter-thread communication in these sections, but given the results of the Pure OpenMP codethis does not really add up. Comparing like for like, the best Mixed loop (Delta) is around 25%slower than the MPI counterpart, with the worst (Update) being closer to 75% slower. These dif-ferences are too large to be due solely to the inclusion of additional memory accesses, suggestingthat perhaps the overhead generated from the parallel regions themselves may be to blame.

Across the board, the computation loops do not show any particularly obvious trends in terms ofchoice of process topology (with one exception – see below). There is some evidence that longerJ loops are in general the best choice (with the 1D version then giving the best Point-to-Pointcommunication time), but this trend is not completely consistent. However, when taking the 8LPAR loops into account it must be remembered that the computation scalability may well havereached saturation-point for this problem size.

One striking effect is witnessed in the 16×4 and 4×16 Algorithm times for the Pure MPI code on8 LPARs. Using either of these process decompositions causes the Algorithm time to increase byalmost a factor of three, and this effect is reproducible given the standard deviations present in thedata. These decompositions were investigated with the hpmcount utility, part of the HardwarePerformance Monitoring Toolkit [27], under the assumption that some form of cache-thrashingmust be to blame for this anomaly. The L2 and L3 caches showed no unusual behaviour comparedto a run with the 8 × 8 decomposition, which is not really surprising given that the problem sizeis now around 190 KB per processor and hence would easily fit into the 720 KB of availableL2. This suggests that something, probably conflict misses, is going on in L1 that is causing thisslowdown; however this anomaly is not a very critical part of this analysis and hence was notinvestigated further.

These results do appear somewhat disappointing from the Mixed code’s perspective, as it is outper-formed in almost every respect by the Pure MPI. However, whilst the reason behind the Point-to-Pointcomms being faster in MPI is as yet unclear, the three computation loops could well be suffering due tooverhead generated in the three parallel regions. If this were indeed true and could be eliminated,the Mixed code would show a marked improvement in performance and may start to win out over theMPI.

To test this supposition further, a Mixed run on 4 LPARs was attempted using only 1 thread per processand allocated as many processes as processors – in effect a Pure MPI run with the addition of theparallel overhead. A run on 8 LPARs was not attempted, since the problem appears to have ceasedto scale by that stage anyway. The same process decompositions were chosen as for the Pure MPI run,and the results are presented in Figure 3.11 and in the Appendix (A.1.1).

Comparing this graph to Figure 3.7, one can see that the three computation sections continue to runslower with the Mixed version even when the same number of MPI processes in an identical topologyare assigned. In addition, the three computation sections show some improvement between the 8-threadand 1-thread Mixed runs, demonstrating that replacing threads with processes improves performance.Since the code is essentially doing the same work (discounting the memory-instead-of-comms situationdiscussed at length above), this does strongly suggest that the overhead from the three parallelregions is at least partly to blame for the poorer Mixed performance. This result is then the motivation

30

Figure 3.11: Timer data for Mixed runs performed on 4 LPARs with 1 thread per process. Horizontalaxis displays the given process layout in 2D

for the next section on Pure OpenMP studies.

One final point to be noted out of completeness is that the “other” section has become larger in all the4 and 8 LPAR runs for the Mixed and MPI codes, compared to the 1 LPAR runs detailed earlier. Thisis most likely due to the increased amount of timer calls and resultant calculations (for recording thetotal time over all iterations, each section’s time must be summed separately on each process) going onbehind the scenes, and is not a cause for concern.

Pure OpenMP Studies

The purpose of this section was to determine which of the various Pure OpenMP versions gave the bestperformance, with the intention that the winner would then form the basis for an improved Mixed versionthat would then hopefully be able to compete against the Pure MPI. Of course, given that the secondMixed code version has already been detailed in Section 3.2, the answer has already been revealed;however, the performance data that led up to this decision is still important.

Before presenting the data, some special consideration will first be given to Version 3 of the PureOpenMP codes (1D; decomposition across the J index). No data was rigorously recorded for thisversion because it quickly became apparent that it gave incredibly poor performance.

The reasons behind the performance failure of version 3 are most likely two-fold. One is the same row-

31

decomposition cache-line invalidation problem detailed in the Pure Code analysis above. The secondreason appears to be tied up with the way the OpenMP directives are turned into functioning code bythe compiler. During the development stage of code design (i.e. before this run with the Fixed ProblemSize), attempts were made to compile the code with -O4 since this was the highest level of optimisationthat would have any effect on the Jacobi code; -O5 deals with inlining functions in a super-optimisedfashion, and since the Jacobi code was designed with all operations of the kernel taking place in themain function this would therefore have had no effect. However, version 3 of the OpenMP code provedto be unstable under -O4 and a test Mixed version built from it broke completely. This suggests that theextracted OpenMP-generated functions for this version were implemented in a less than optimal way, asall other Mixed/OpenMP versions ran correctly with -O4. This could then be partially responsible forthe very poor performance of this code.

With version 3 so discarded (and the reason as to why a single parallel region version was nevergenerated now apparent), our attention turns to the other flavours of OpenMP implementation. Theseresults are graphed in Figures 3.12 and 3.13, and tabulated as normal (see Appendix A.1.1) – note thatthe results from version 4 have been reproduced in Figure 3.13 for ease of reference.

To summarise, the different versions have the following features:

Version Code Design1 3 parallel for directives2 1 parallel region; 3 for directives4 3 parallel regions; 2D decomposition by hand5 1 parallel region; 2D decomposition by hand

Table 3.2: Tested OpenMP code versions and their features

These results indicate that version 2 is the clear winner, with the 8-thread run outperforming version4’s 8 × 1 decomposition by around 8%. General trends in fact show that version 4 is the slowest code;the performance improves with both the change to one parallel region, and the change to threeparallel for directives, hence combining these two changes provides the best use of OpenMP forthe Jacobi code.

7-thread runs were again performed for all these versions, but once again performance figures areslightly better for 8 threads even if the scalability is poor. Note that the results for version 5 are in-conclusive as to whether 7 × 1 is better than 8 × 1 because the total times for each overlap within onestandard deviation of each other.

It should be noted that an 8-thread use of version 2 gives the same work decomposition as an 8 × 1use of version 4. Hence the gains seen in performance must be operating at a more subtle level thansimply the apportioning of work. One possible answer is the reduction in thread overhead by beginningand ending the parallel region outside the main iteration loop. As discussed earlier, the OpenMPimplementation is very unlikely to be spawning and killing threads with each region in version 4 (and1), but reassigning instructions to spinning threads or waking up sleeping ones may have been slowingthings down.

A second possibility as to what could be going on behind the scenes is the generation of OpenMP-onlyfunctions at compile time. Since we have already seen that outlined OpenMP functions interact withcompiler optimisations (as version 3 breaks with -O4), it is therefore not unreasonable to assume that

32

Figure 3.12: Timer data for Pure OpenMP version 1 (top) and 2 (bottom) runs. Horizontal axis displaysthe number of threads used

33

Figure 3.13: Timer data for Pure OpenMP version 4 (top) and 5 (bottom) runs. Horizontal axis displaysthe given thread layout in 2D

34

these interactions could be positive as well. Hence the speculation here is that the functions outlinedwith the for directives in version 2 are more suited to the optimisations performed under -O3 thanthose from any other.

Note that it is unfair to compare version 1 and 2 say for such behaviour in Algorithm, as version 1 hasa forced barrier at the end of its section. Whilst it is therefore possible that the performance gain fromusing version 2 comes entirely from the use of nowait on the Algorithm section, the fact that version5 does not display a similar improvement (it too has no barrier at the end of its Algorithm loop) doessuggest that something else is in play as well.

Another point of interest is the increased time spent in the “other” section for the code versions thatonly have one parallel region. This likely comes from the small sections of the kernel (mainly timercalls) that take place inside master directives; whilst such directives are the most efficient method ofrestricting operations to one thread, it will still slow things down a little as the master thread catches upwith the others.

With the performance victor determined, a second version of the Mixed code was written with its kerneldesign based on version 2 of the OpenMP code as described in Section 3.2. Whilst dropping the 2D de-composition may appear to be a backward step in terms of flexibility, all the available data demonstratesthat the Jacobi code gains no benefit whatsoever from that option.

Improved Mixed vs. MPI

We now come to the final use of the fixed problem test case: confirmation that the new Mixed versiondemonstrates an improvement in performance compared to the old. The Mixed code was rerun on4 LPARs only (since it appeared that 8 had reached scalability saturation), and compared against theMixed run from before. Both plots, including a reproduction of the Mixed results for ease of reference,are given in Figure 3.14 with tabulated data in the Appendix (A.1.1).

These results demonstrate that the new version of the Mixed code runs around 25% faster than the old,and more importantly around 6% faster than the Pure MPI code (see Figure 3.7). Note that this cannotbe taken as a gold-star win for the new Mixed version, as a quick look at the communication timesbetween the old and new codes shows that both the Point-to-Point and Collective sections ran fasterwith the new. Since no alterations to the comm pattern have taken place, it must be assumed that someof the improvement seen in the new Mixed results is in fact due to the system being less heavily loadedwhen the new results were taken.

However, all three computation sections of the Mixed code show improvement, in particular the Algo-rithm section which demonstrates a 34% increase in performance. Since it was the intention to improvethe computation sections by improving the underlying use of OpenMP in the kernel, the modificationsto the Mixed code can be considered a success.

The overall characteristics of the Mixed code appear similar to the old version, with the fastest overallruntime coming from the 4×1 process decomposition mainly due to a saving in the MPI communicationtimes. Indeed, there is stronger evidence with the new version that the Algorithm and Update sectionsactually run slower with this process topology, possibly due to the fact that all data/work distribution isnow taking place along the M axis of the 2D problem.

35

Figure 3.14: Timer data for Version 2 Mixed (top) and Version 1 Mixed (bottom) runs performed on 4LPARs. Horizontal axis displays the given process layout in 2D.

36

As a final test of the new Mixed version’s OpenMP efficiency, a run on 4 LPARs with only 1 thread perprocess was again performed, in order to see how much threaded-sections’ overhead was contributingto the Mixed’s runtime. This graph is shown in Figure 3.15.

Figure 3.15: Timer data for a Version 2 Mixed run performed on 4 LPARs with 1 thread per process.Horizontal axis displays the given process layout in 2D.

Comparing this graph to Figures 3.7 and 3.11, we can now see that thread overhead has been completelyeliminated in the Algorithm section, for any of the reasons suggested in the Pure OpenMP Studiessection above. The Delta and Update loops still run slightly slower in Mixed, but the new versionhas managed to improve things somewhat. However, the Algorithm saving is the biggest performancebenefit from version 2, giving new confidence to the Mixed’s ability to outperform the Pure MPI.

Given that the only code to have demonstrated problems with level 4 optimisation had by this stage beendropped (version 3 of the OpenMP code), repeated runs for the Pure MPI and Mixed codes (both 8 and1 thread(s)) were made under -O4 on 4 LPARs. However, the results from this run showed that -O4had no overall effect on any of the total runtimes; some sections of the codes did improve somewhat, butothers slowed down to give a net effect of zero. Therefore, the compiler optimisation was left at -O3for all further Jacobi studies.

This section has highlighted one problem with the data presented – HPCx itself. Reproducibility oftimer results on the Service can often be somewhat variable, particularly when it comes to recordingsof communication times across the interconnect. It is impossible to predict the current pressure on thecommunication hardware without having details of every code running on the system, and since it isimpractical to reserve the entire machine for studies such as these there is no choice but to bite the bulletand accept some variation in comm time between runs.

37

With the computation sections improved somewhat, and the Collective section running faster with theMixed code as expected, there still remains the puzzle of why the Point-to-Point communications exhibitsuch a slowdown when switching from lots of small messages in the MPI model to a few larger ones inthe Mixed. With the computation now running at almost the same speeds between the different parallelimplementations, and the Collectives faster already with Mixed, if the Point-to-Point comms could alsobe improved then the Mixed code would come out the clear winner. This question will be answered inthe next section.

3.4.2 Scaling Problem Size

The second problem chosen for experimentation was set to have M ×N equal to 450× 450 per proces-sor, a local problem size of approximately 2.43 MB. This problem was set to scale up as the number ofprocessors was increased, whilst keeping the overall problem geometry as square as possible; for exam-ple, when sixteen processors (2 LPARs) were used, the total M ×N problem size was 450∗4×450∗4.This particular size was chosen because it exceeds the L2 cache per processor (720 KB), but will fitcomfortably into the L3 per processor (16 MB). This is quite representative of typical scientific applica-tions. Note that in a world of perfectly scaling code, all runs with such a problem should take exactly thesame amount of time to execute, irrespective of either the parallel implementation chosen or the numberof processors allocated.

The code was fixed to run for 5000 iterations, by setting the convergence tolerance to an extremely smallvalue of 1 × 10−6; this ensures again that the same amount of work is being done for all runs.

This set of runs forms the core of the Jacobi section of the project, as its primary purpose is to comparethe performance of the Mixed code against the MPI. Note that in all cases, version 2 of the Mixed codeand version 2 of the Pure OpenMP code were used for the analysis. In addition to the full code tests,runs were also performed with the Delta and Collective sections switched off, in order to test both thelower level of MPI synchronisation present and the effectiveness of the Mixed code when its primarysource of performance gain is removed.

Note that a second set of experiments was also run, which scaled with a problem of 230 × 230 perprocessor (a local size of approximately 635 KB). This problem was chosen to fit into L2, and thenumerical data is presented in the Appendix (A.1.3). However, the data obtained shows almost identicaltrends to the L3 scaling problem, and so will not be presented graphically or discussed separately.

The results are presented first in order of increasing processor numbers, subdivided into Small (1 LPARincluding Pure OpenMP), Medium (2 and 4 LPARs), and Large (8 and 16 LPARs). A more in depthstudy of some specific features follows later.

Small

For 1 LPAR runs, it was possible to test all three parallel codes at the same time. Graphs of the results forthe OpenMP, Mixed, and MPI codes are shown in Figures 3.16, 3.17, and 3.18 respectively. Note thatfor ease of reference, the results for both Collectives On and Collectives Off are displayed on the samegraph for each parallel code, and are twinned with each other based on process decompositions/threadnumbers; the convention used is that the Collectives On results are placed on the left. All numerical

38

data, including standard deviations for each code section, are included in the Appendix (A.1.2).

Figure 3.16: Timer data for OpenMP for the L3 Scaling Problem Size on 1 LPAR. Horizontal axisdisplays the number of threads used.

Turning first of all to the Collectives On data, the Pure OpenMP code stands out as the clear winnerin terms of performance, with an average runtime approximately 17% faster than either the Mixed orMPI codes. Only the Update section of the code runs slower compared to the other implementations,but this is more than made up for by the sizeable gains in the other two sections. Again, considering thenecessary folding-in of the Point-to-Point and Collective sections in the Pure OpenMP code, this doesseem to be a good indication that shared memory use on-node is the way to go.

However, this is not borne out in the Mixed code’s results. The total runtime for the Mixed and best (4×2) MPI code overlap within one standard deviation, meaning that they essentially run at the same speed.This would be fine were it not for the fact that the (1 process; 8 thread) Mixed code is performing no MPIcommunication at all on 1 LPAR (recorded time in the Point-to-Point and Collective sections is mostlikely generated from the single MPI process checking that it has no other processes to communicatewith). This means that the Mixed code has slowed down from the OpenMP time to the MPI time solelythrough problems with the computational sections.

This is initially appears to be quite puzzling. The computation sections of the Mixed and Pure OpenMPcodes are implemented in exactly the same way (as discussed in Section 3.4.1), and whilst the Updatesections of the two codes run at the same speed within experimental error, the Algorithm and Deltaloops display performance drops of 25% and 23% respectively when moving to the Mixed code. Onepossible reason could again be related to compiler optimisations. The one difference between the twoversions is that the upper loop bounds in the computation sections are set by the #defines M and N

39

Figure 3.17: Timer data for Mixed (1 process; 8 threads) for the L3 Scaling Problem Size on 1 LPAR.Horizontal axis displays the given MPI process layout in 2D.

in the OpenMP code, but in the Mixed they are governed by the int variables MP and NP . Thereforethe -O3 optimisations may produce faster code when the upper loop bounds are known at compile time,as is the case with the Pure OpenMP. However, this is still a sizeable amount of performance gain to beobtained from this difference, so it does appear that another effect is at least partly responsible.

This problem with the computational sections is also evident when we examine the MPI data as well.Since the Mixed (1 process; 8 threads) code has no MPI communication at all, it seems logical toassume that it would run slightly faster than the Pure MPI code given the apparent superiority of sharedmemory communications as seen with the Pure OpenMP. However, again, the Mixed code is broughtlevel with the MPI code based on a slowdown in its computation time. Comparing the Mixed with thebest MPI, we see that whilst the Algorithm runs 1% faster with Mixed, the Delta and Update slow downby 10% and 20% respectively. This drop in performance appears to be too large to be due solely tointer-thread communication. Recalling that this version of the Mixed code was chosen specifically toreduce the overhead from the computation sections, and that the implementation is virtually identical tothe OpenMP, this drop in performance appears difficult to rationalise.

Regarding the Pure MPI code, we see that the best performance comes from the 4 × 2 process decom-position. Interestingly, this is due not to the Collective section (which appears insensitive to the de-composition choice within experimental error), nor from the Point-to-Point comms (which again favourthe fully-contiguous sends of an 8 × 1 decomposition), but rather from performance gains made in allthree computation sections with this process layout. This may be due to a feature of the global problemtopology, since within 1 LPAR the total size is given by 450 ∗ 4 × 450 ∗ 2; hence it appears that the

40

Figure 3.18: Timer data for MPI for the L3 Scaling Problem Size on 1 LPAR. Horizontal axis displaysthe given MPI process layout in 2D.

MPI is favouring the process topology that maps most closely to the overall geometry. This perhapssuggests that the resulting square double-loops are the most compatible with compiler optimisations forthis problem, although such an explanation does appear unlikely.

With the collective routines switched off, the picture appears slightly different. The OpenMP is againthe winner in terms of performance, with the Mixed code still showing the unexplained slowdown incomputation time compared to the others. Due to a sudden jump in the standard deviations of the MPIdata, it is not possible to make a clear judgement as to which decomposition topology is now favoured,as all of the total runtimes overlap to some degree. The rise in error with the computation may be dueto the removal of the implied barrier built into the Allreduce, but the rise in communication time ismore likely due to increased traffic on the system.

In both the Mixed and OpenMP codes, the Update section now takes considerably longer with theCollectives section turned off; this is likely due to the Update section now being responsible for syn-chronising the threads due to the nowait on Algorithm, whereas before this was Delta’s responsibility.Interestingly, the Algorithm section improves with collectives off in the OpenMP code, but gets slightlyworse in the Mixed code. Given the difficulty in determining the performance characteristics of theMixed’s computation sections, this feature is rather difficult to analyse further.

Overall, the 1 LPAR results paint a depressing picture for the Mixed code, as it appears unable to keepup with its OpenMP counterpart despite the similarity between their computation operations. If it could,it would clearly outperform the Pure MPI code at this level, and might stand a chance of keeping that

41

lead as we move to larger numbers of LPARs. However, since no more Pure OpenMP studies can bemade beyond 1 LPAR, it remains to be seen whether any additional factors will start to favour the Mixedcode.

Medium

Graphs of the Mixed and MPI performance figures for 2 and 4 LPARs are shown in Figures 3.19 and3.20; tabulated data is presented in the Appendix (A.1.2). Again, both collectives on and collectives offruns are displayed on the same graphs, matched by process decomposition. Note that for the 4 LPARrun, Pure MPI decompositions of 16 × 2 and 2 × 16 were not tested as it was felt that they would giveperformance figures that would be similar to the 1D decompositions.

For the Collectives On runs:

• The best Mixed decompositions are 2 × 1 and 4 × 1, with the performance improvement comingalmost entirely from a reduction in Point-to-Point time due to all-contiguous sends. The compu-tation sections are less well defined, with no clear trend for a favoured decomposition emerging.The Collective section continues to remain insensitive to the topology chosen.

• The best MPI decompositions are harder to distinguish, with 2 × 8 only fractionally slower than16 × 1, and 8 × 4 coming out ahead for the 4 LPAR numbers but carrying a comparatively largeerror. For individual sections, the Point-to-Point shows the now familiar trend, but interestinglythe Collective Section now appears to be demonstrating a growing sensitivity to the process topol-ogy as well. Given the errors on these numbers (which are not unexpected since the Allreducefunction is the part of the code most sensitive to traffic on the system), it is impossible to makedefinite claims; however, there does appear to be some leaning towards a process topology thatmatches the global problem geometry. Why this is occurring is not clear; one possibility is thatthe Tree-algorithm used inside the global communication is able to build up a faster communica-tion pattern between processors when they are arranged in blocks on their respective LPARs (as asquare-like decomposition) as opposed to strips (as 1D decompositions).

The computation sections are inconsistent across the two data sets. The 2 LPAR run attributes thebest performance to the 2× 8 decomposition, which does not match the overall problem shape orgive longer internal loops for better unrolling (the only reason 16 × 1 is slightly faster is becauseof the sizeable gain in Point-to-Point time). However, the performance differences are quite smallwhen comparing sections, and when we move out to 4 LPARs the same decomposition choice(i.e. the one matching the problem geometry) as was favoured with the 1 LPAR runs is beingpicked out again here; the same reasoning for this therefore still stands.

• Comparing the two parallel codes, yet again the best Mixed is just slightly slower than the bestPure MPI. However, the situation has now grown worse: in addition to the computation sectionsrunning slower in the Mixed code, the still-unexplained phenomena of the Mixed Point-to-Pointsection running slower than the Pure MPI has reappeared again, despite both Mixed and MPImessages still lying well below the Eager Limit. This means that Mixed code has been reducedto providing a performance increase over the Pure MPI code in only one section of the kernel –the Collective communications – seemingly in spite of the improvements made to the code in theprevious Section (3.4.1).

42

Figure 3.19: Timer data for Mixed (top) and MPI (bottom) runs on 2 LPARs. Horizontal axis displaysthe given MPI process layout in 2D.

43


44

Looking now to the Collectives Off results, an essentially identical picture emerges for both codes. Thesame process decompositions are favoured, and for the 2 LPAR run the Mixed code continues to runslower than the Pure MPI; this is hardly surprising, given that the Mixed code’s only advantage lies inthe Collective section. The same trends in the Point-to-Point and computation sections are also againbroadly seen.

Some care must be taken when analysing the comparative performance on 4 LPARs with the CollectivesOff. Whilst on a first glance it appears that the Mixed code has actually managed to pull ahead of thePure MPI, attention must be drawn to the Point-to-Point times recorded for the Pure MPI code. All areconsiderably higher than was seen for the Collectives On run, and the most likely explanation for thisis simply increased traffic on the interconnect when this data was gathered. If the communication timeswere “reset” to the values gathered for the On run, the Mixed code would be clearly lagging once again.

As a final aside to this section, the actual scalability of the Jacobi code appears to be quite good overall,with the 1 LPAR to 4 LPAR runtimes only increasing by around 10%.

Large

Results for 8 and 16 LPAR runs can be found in Figures 3.21 and 3.22, with tabulated data presentedin the Appendix (A.1.2). As usual, Collectives On and Off are displayed on the same graph. Note thatthe 16 LPAR run was only run for 2000 iterations, in order to reduce the runtime and hence make thedata “cheaper” to gather on HPCx; the data recorded has been scaled up to be comparable with a 5000iteration run by simply multiplying all values by 2.5, and this scaled data is what can be found in Figure3.22 and the Appendix. Also note that only a restricted set of all possible process decompositions forboth Pure MPI runs was tested, as it was felt that many would simply give similar characteristics totopologies already under consideration.

For these data sets, the Mixed and MPI times are now too close to call. For both On and Off runs,and on both 8 and 16 LPARs, the best Mixed runtimes are in fact slightly faster than the best MPI;however, the errors present on these values are considerably larger than the difference between them.This does suggest that the Mixed code is catching up though, due to the ever-more significant reductionin collective communication time.

Once again, the same broad characteristics are true. The best Mixed decompositions are 8 × 1 and16 × 1, due mainly to the faster Point-to-Point, as before. The Mixed computation sections continueto show no real trend towards a favoured topology; but with the increased number of MPI processesin play, the Mixed Collective section is starting to demonstrate the same preference for a match to theproblem geometry as seen with the Medium Pure MPI data.

The best MPI decompositions are 8×8 and 128×1, although factoring in the errors present this cannot bestated conclusively. These assignments give the best computation times for 8 and 16 LPARs respectively,which is surprising for the latter since it does not match the global problem shape. However, comparingthe 128×1 numbers to the 16×8, the recorded times overlap within one standard deviation which againmakes any firm conclusions difficult. Point-to-Point and Collective trends are the same as they were forthe Medium data sets.

One interesting feature is present in this data: the On/Off relationship with the Point-to-Point section.With Collectives On, the Point-to-Point section shows the same behaviour as in all previous cases,

45


46


47

with the Mixed times still running slower than the MPI. However with the Collectives switched Off,the Mixed comm times actually run faster than the MPI despite both codes’ time spent in this sectionincreasing. Unfortunately, this is not good news for the Mixed code. Considering the data gatheredover all the LPAR runs from the Small and Medium runs in addition to this, it appears that with theremoval of the barrier in the Off data, the Sendrecvs in the Point-to-Point are now taking up some ofthe synchronisation slack for the processes. One would therefore expect that the MPI code to run slowerin this section, since it has considerably more processes to engage in partial pairwise-synchronisation.This does however mean that the earlier Off data for the Pure MPI run may not have been due to systemtraffic as was first thought, but there was insufficient evidence at that point to make a conjecture such asthis.

Overall, the same picture has emerged again. The Mixed code only makes real performance gains inits Collective section (the Off Point-to-Point times for this many processors notwithstanding), with itscomputation sections and the properly-synchronised Point-to-Point all showing an increase in runtimecompared with the Pure MPI code. The next two sections are intended to explain these trends.

48

Point-to-Point Communication Study

In an effort to understand the Point-to-Point behaviour between the Mixed and MPI versions, it is nec-essary to gather additional data regarding the transition between the two different parallel modes. Tothis end, a 4 LPAR Mixed run was performed first with 2 processes per LPAR and 4 threads per process,and then with 4 processes per LPAR and 2 threads per process. For both cases, only the predicted “best”process topology was used (8 × 1 and 16 × 1 respectively) based on the previous data.

These figures were then compared to the best runtimes for the original 1 process and 8 thread Mixedmodel and the Pure MPI code from the 4 LPAR studies made earlier. This data is given in the usualhistogram format in Figure 3.23, and the relevant values tabulated in the Appendix (A.1.2).

Figure 3.23: Timer data for differing process/thread combinations for the Mixed code for 4 LPAR runs,and a corresponding Pure MPI run. Horizontal axis displays the given MPI process layout in 2D, alongwith the number of threads per process used if referring to a Mixed run.

This graph clearly demonstrates that as the number of MPI processes in the Mixed code is increased,and hence the number of Point-to-Point communications taking place is increased, the time taken forthese communications to execute is decreased; since all message sizes in this study lie below the EagerLimit, this does appear to be the inverse of what one would expect.

There is still insufficient data from which to make any kind of conclusive explanation of this behaviour.In an effort to gain a better understanding of what is going on, the hpmcount utility was again em-ployed. Unlike in the previous case, when the counter was used to monitor the entire code, in thisinstance only the Point-to-Point section is of interest; therefore, the libhpm version of HPM [27] wasused to directly instrument only this section of kernel (essentially replacing the MPI timer calls with

49

HPM instrumentation calls).

HPM was used to monitor the usage of the L2 and L3 caches and main memory, as it was felt thatmemory access was the most likely cause of the performance difference. The four different data setsdescribed in Figure 3.23 were re-run with the HPM calls in place; data was recorded for each MPIprocess present in the run. Given that the actual problem resides in L3, one would expect most of thememory traffic to be occurring in L3. Some L2 usage would also be expected, but since L3 is manytimes slower in terms of access speed than L2, it is expected that L3 use would completely dominate thetime spent in this section.

Instead, we find unexpected traffic taking place through to main memory. Such accesses are consider-ably slower than L3 usage, so in fact this section is being dominated by main memory traffic. In additionto the presence of this traffic, it also appears on the MPI processes in a very distinct pattern as can beseen in Figure 3.24.

The picture is now becoming clearer. Time spent in Sendrecv calls is not just due solely to com-munication between processors, but also includes time spent gathering the data to be sent, and thenplacing this data in the correct halo after the receive. For the Pure MPI run, the load usage is mostly flat,with spikes corresponding to processes that lie at the edge of an LPAR (recall that all recorded HPMdata corresponds to 1D process decompositions) – since these processes must communicate through theinterconnect, this spike in loads may indicate the passage of the data through the switch of the LPARand then up through the shared memory subsystem and vice versa, as the comms first attempt to travelthrough the memory before realising that they must instead cross an LPAR boundary. On-LPAR com-munication is handled entirely in the shared memory for Point-to-Point, and this may account for thebase-level of memory loads on these processes.

As the Mixed model takes over, less processes are engaged in on-node communications as the threadsbegin to do more of the work; this results in the “spikes” becoming progressively more dominant untilthey eventually become the over-riding behaviour on the processes. Since main memory loads areso slow, this becomes much more of a problem with the Mixed code as the number of processes isdecreased; this rise in the average memory traffic therefore accounts for much of the slowdown in thissection.

Another fact that must be considered here is that the Mixed code only calls MPI functions on the masterthread. This means that the Mixed code used throughout the Scaling Problem Size has the data-to-communication pattern shown in Figure 3.25.

This shows that the master thread will first have to obtain the data to be sent from the cache of theprocessor running the edge-thread for the case of left/right sends, or from the caches of all the otherthreads in the case of up/down sends. This will therefore take longer than a comparable process involvedin a send, as it will already have the necessary data stored in cache. Indeed, up/down comms are evenworse under these circumstances, as the received data will be held on the master thread until each threadaccesses it as needed in the Algorithm section. For left/right comms, the master thread will “own” thecommunicated data half of the time.

So, two events are conspiring to make the Mixed code’s Point-to-Point communication time take longer.One is the large increase in memory traffic, possibly due to the MPI comms travelling first through theshared memory before edge-send/receives realise that they must travel across the switch. The secondis that the threads involved in communication must first obtain the data before it can be sent. Both

50

0 1 2 3Process ID

3e+05

4e+05

5e+05

6e+05

Mem

ory

Loads

0 2 4 6 8Process ID

1e+05

2e+05

3e+05

4e+05

Mem

ory

Loads

0 8 16Process ID

1e+05

2e+05

3e+05

4e+05

Mem

ory

Loads

0 8 16 24 32Process ID

1e+05

2e+05

3e+05

4e+05

5e+05

Mem

ory

Loads

Figure 3.24: Line graphs showing the total number of Main Memory Loads recorded on each MPIprocess for Mixed (1p x 8t) (top left), Mixed (2p x 4t) (top right), Mixed (4p x 2t) (bottom left), andMPI (bottom right) on 4 LPARs.

problems could be circumvented by redesigning the code to allow individual threads to communicateas if they were processes over LPAR boundaries by manually coding in methods for all thread IDs tobe unique and all threads to be aware of their neighbours, as this would bypass the second problemcompletely and reduce the first problem down to the Pure MPI Memory Load behaviour. However, suchcode development would be rather involved, and was hence not attempted within the time-frame of thisproject.

51

Left/right

Up/down

Mas

ter

Thr

ead

Figure 3.25: Representation of the relationship between thread data locality and the MPI communicationpattern for a Mixed code. The shaded area indicates the data halos on the process.

Computation Sections Study

With one part of the Mixed mode’s performance characteristics explained, we now turn our attentionto the second problem: the computation sections. In a similar manner to the method employed in thePoint-to-Point study described above, each of the three sections had their timer calls replaced with HPMinstrumentation calls. These were designed to output hardware usage data for every process in a PureMPI run, or every thread in a Mixed run (since Mixed processes do not perform any computation).Given that these sections deal with array operations, it was again decided to use HPM to monitor theL2, L3, and main memory behaviour. This data has not been provided in the Appendix, simply due tothe quantity of it.

The results obtained were rather confusing. HPM reported that each computation section of the kernelspent some considerable time loading from main memory, despite the fact that the entire problem wasdesigned to fit into L3 per processor with room to spare. There were more loads taking place whenthreads were in use, and since memory loads are so expensive in terms of time this single difference inmemory behaviour is the reason behind the Mixed code’s computation taking longer.

This answered the question, but raised far more. The code should not be accessing main memory at allin these sections. Nowhere is that more clear than in the Update section, where the threads/processesshould only be overwriting the values in old with the values in new; since both arrays fit in cache atthe same time (along with the third array, edge), all the memory traffic related to this operation shouldclearly be taking place in L3. Instead, about 15% of the loads for the processes, up to about 25% of the

52

loads for the threads, are going to main memory. This equates to hundreds of loads from memory perthread/process per iteration of the kernel loops.

The situation becomes even more confusing when the other scaling problem size is considered. Recallthat a second scaling problem size was being run alongside the L3 data, only this problem was designedto fit in the L2 per processor. An HPM instrumented run of this problem size showed the computationsections taking 10% of the loads from L3 for processes and 15% for threads, and a further 5% still goingthrough to main memory for both. This behaviour is simply bizarre, as the L2-fitting problem is tiny incomparison to the size of the L3 per processor and hence no justification can be made of loads still goingthrough to main memory. The actual numbers themselves are not insignificant either, with thousands ofloads from memory per thread/process taking place per section (equating to about one per iteration perthread/process).

In order to fully investigate this cache behaviour, a code was written which simply declared two 1Dfloat arrays, filled them with random data, and then added each element together. The loop thatperformed this addition was then instrumented with HPM, and the code run on a single processor withvarying total array sizes. It was compiled with the same options as the Jacobi code, in order to keepeverything consistent. The results are given in Table 3.3.

Total Problem Size L2 Loads L3 Loads Memory Loads200 KB 15708880 10 0

1200 KB 91757819 1753632 53898210 MB 701858669 68308725 17535457

100 MB 6917507218 528641734 432049931

Table 3.3: HPM Cache/Memory data obtained from the simple array-addition code, for varying totalproblem sizes.

These problem sizes were chosen to fit comfortably into L2 (200 KB), fill most of L2 (1200 KB), fitcomfortably into L3 (10 MB) and fill L3 (100 MB) – recall that this code only uses one processor on anLPAR, but gets the entire LPAR to itself hence 1440 KB of L2 and all 128 MB of L3 are available.

This HPM data clearly shows that the only problem size that has the expected behaviour is the smallest.When the data has expanded to fill L2, over a million rogue loads from L3 and hundreds of thousandsfrom memory are now taking place. As the problem grows into L3 the situation gets even worse, and bythe time L3 is mostly full the number of loads going to L3 and main memory is roughly 50/50.

This “Cache Leak”, seemingly present in both the L2 and L3 caches, is fundamentally a hardwareproblem; however the leakage is more apparent with OpenMP threads than with MPI processes as wasseen with the Jacobi code. This means that the computation sections of the Mixed code are beingunfairly hit due to a problem with the HPCx chip/MCM design, which has really nothing to do with theMixed code itself. However, there does not appear to be any clear way of circumventing this problem inthe software.

Based on email correspondence with IBM, part of this Cache Leak problem can be explained: the L3cache can choose not to retain new data if it is already highly utilised - it instead acts as more of a bufferfor the main memory. This partially explains why some of the main memory traffic is taking place, butit does not explain why the effect should be so evident with problem sizes that leave plenty of space inL3, nor why it shows up in L2 use as well.

53

3.4.3 Summary

Overall, the results from the MPI versus Mixed studies show a fairly poor picture in terms of the Mixedcode’s performance. The Mixed code runs with faster collective communications, due in part to the factthat there are less processes involved in the MPI call, and also because part of the collective operation isincluded in the Delta loop. However, all of the computation sections run slower, due to the inclusion ofthe thread’s shared memory communications and also from the Cache Leak identified above. In addition,the Mixed Point-to-Point communications are slower than the Pure MPI, due to threads having to re-cache data after masteronly MPI calls and from cross-LPAR comm traffic now being more apparent.

However, the Point-to-Point could be made to run faster in the Mixed code if the communication hier-archy was redesigned to allow for individual threads to communicate with MPI for cross-LPAR bound-aries; this would alleviate the re-caching issue. This may also overcome an inherent problem that theexisting Mixed code has essentially reduced the level of parallelism in the Point-to-Point communica-tions, as inter- and intra-node communication now take place at different points in the kernel; sincethe latter must now occur during computation, and this overlap is not handled efficiently with somethreads communicating whilst others perform calculations as is the case with some Mixed codes, thisreduces the performance to some degree. A Mixed design with individually communicating threadswould address this limitation.

The Mixed code is superior in terms of collective communications, as the OpenMP reduction doesappear to improve performance of this feature overall. Therefore, parallel codes which make heavyuse of the collective functions in the MPI library could benefit from employing a masteronly Mixedmode design. With an improved Point-to-Point implementation as well, it may be the case that allcommunication-dominated codes see improvements under a Mixed implementation.

54

Chapter 4

ASCI Purple Benchmarks

This chapter details the second stage of the Mixed Mode performance analysis conducted in this project,which was carried out with a suite of established benchmarks. An introduction to the benchmark suiteand the codes employed is first presented, followed by a description of the experimental methodologyadhered to. The results and performance analysis for each code are then outlined separately.

4.1 Introduction

The Accelerated Strategic Computing Initiative (ASCI) is a supercomputer development program fundedby the U.S. government. It is currently in its fourth stage of a five stage project plan with the ASCI QMachine [24]; the fifth stage in this plan will be ASCI Purple [23], which is being designed to have apeak operational speed of 100 Tflop/s.

Towards this performance goal, ASCI have released several benchmark codes which will ultimatelybe used to test the system; some of the performance deliverables will even be based on the efficiencyof specific benchmarks. These codes are free to download and use, with rigid usage guidelines onlybeing imposed when conformability with the ASCI RFP is required (i.e. when ASCI funding for codedevelopment is sought after).

These benchmark codes have therefore been selected for this project because they are already used in theindustry. In addition, given that ASCI Purple will be a Clustered SMP System, many of the benchmarksare designed with Mixed Mode functionality incorporated as a compile option. Often the parallelismis built from MPI and either OpenMP or POSIX threads, although other implementations do exist forsome codes. These codes have been released with the intention that the users attempt to optimise themfurther (and hence make ASCI’s job of obtaining the required performance figures easier), but for thisproject the focus remains on a comparison between the Mixed and MPI implementations.

55

4.2 Codes Employed

There are nine primary benchmarking codes and three smaller secondary test codes available on theASCI Purple Benchmark website:

http://www.llnl.gov/asci/purple/benchmarks/

This section describes the three codes that were used for this part of the Mixed Mode project; the onlypre-requisite for selection was that a Mixed Mode version employing MPI and OpenMP had to be avail-able. Three codes were selected as representative of covering a range of application types. It was theintention to use a fourth code – MDCASK – in addition to the chosen three, but compiler conflictsresulted in the Mixed version breaking at runtime; this was unfortunately not discovered until the datagathering stage (because the only useful test cases were for the Pure MPI and OpenMP versions), where-upon this code was discarded. All of the following information is extracted from the code Readme’sthat can be found on the above website.

4.2.1 SMG2000

This code is a parallel Semicoarsening MultiGrid (SMG) Solver, which is used to solve the linear sys-tems arising from the diffusion equation on rectangular grids; the code was set-up to solve 3D systemsusing a 27-point stencil. To determine when the solver has converged, the driver monitors a relative-residual stopping criteria, and when this value falls below a certain cut-off the code terminates. Thisresults in SMG200 being “self-checking”; aside from timer output, the only piece of data with whichthe user can check the correctness of the code is the final residual value, which should lie around 10−7.

SMG2000 is written in entirely in ISO-C, and has compiler options that allow Pure OpenMP, Pure MPI,and Mixed MPI+OpenMP versions to be built separately. The MPI parallelism is handled using datadecomposition, and the OpenMP performs work-decomposition over computationally intensive loopsat various places in the kernel files. This is very similar to the core design of the Jacobi code from theprevious chapter.

The code is described as being “highly synchronous”, with parallel efficiency determined by the size ofthe data blocks assigned in the decomposition, along with the computation and communication speedsof the machine available. The code also only performs “1-2 computations per memory access”, whichcould well be important given the discovery of the Cache Leak detailed in the Jacobi analysis.

4.2.2 UMT2K

This code solves the first-order form of the steady-state Boltzmann transport equation, and describes a3D photon transport problem for unstructured meshes (Unstructured Mesh Transport – UMT). The codegenerates these meshes at run-time in 2D and then extrudes them into the third dimension; the solutionis then calculated by tracking through the mesh in the directions of the “ordinates”, a set of associateddirections that model the angular dependence of the problem. This code produces extensive outputfiles, and provides sample output-sets for given test cases which allows for very thorough correctness-

56

checking.

UMT2K is written in Fortran90 and C, with most of the kernel computation in the latter. The compilerhas in-built options that allow for Pure OpenMP and Mixed MPI+OpenMP versions to be built – notethat a Pure MPI equivalent is obtained by running the Mixed code with one thread per process. Thetwo parallel implementations are very different: the MPI parallelism operates across the mesh, anddistributes portions of it across the processes; the OpenMP on the other hand divides up the ordinatesof the mesh portions amongst the threads, by means of a single parallel directive (a parallel for)placed across one loop in a kernel file. Hence the UMT2K code represents a rather different approachto the work/data decomposition than has been seen previously.

The UMT2K benchmark is very large, and utilises two separate in-built libraries (called METIS andSILO) in addition to its own code; this makes a detailed study of what the code is doing at any particularinstance a rather complicated affair. One note that may be important is that the code description pointsout that the “memory access patterns may vary substantially for each ordinate on a given mesh”, andgiven that ordinate decomposition is only handled by the OpenMP implementation this fact could affectthe performance of this section.

4.2.3 sPPM

This code solves a 3D gas dynamics problem on a uniform Cartesian mesh using a “simplified” versionof the Piecewise Parabolic Method (sPPM). Again, different compiler options exist for making separatePure OpenMP, Pure MPI, and Mixed MPI+OpenMP versions of the code, and test-case output files weresupplied in the installation package that allowed for thorough correctness-checking.

The sPPM code has the most importance attached to it of any of the ASCI Purple Benchmarks, asone of the goals of the design strategy is to obtain a sustained performance of 35 Tflop/s using thisbenchmark. In addition, sPPM was used as a performance deliverable for the HPCx Service during itsinitial development phase, hence making it of particular interest to this project given that HPCx was theClustered SMP used for gathering all of the performance data.

sPPM is written primarily in Fortran77 with a few routines in C. Its parallel implementations occurin the “usual” way, with MPI handling initial data decomposition and then OpenMP distributing thework of computationally intensive loops in the kernel routines as was the case for the Jacobi code andSMG2000 described above.

It was quite difficult to build up a coherent picture of the code’s subroutine hierarchy, as the source issupplied as m4macro files. The compiler then feeds these through the m4 preprocessor, and then throughanother preprocessor (cpp), before finally generating Fortran77 source files. The code description doesnote that the benchmark is heavily dominated by the computation in the kernel routines, and performscomparatively little explicit MPI communication. This therefore gives the machine usage pattern ofthis code a somewhat different feel compared to the other benchmarks and the Jacobi code, despite theunderlying parallel implementation being similar (except when compared to UMT2K).

57

4.3 Methodology

This section covers the experimental procedure adopted and machine employed for assembling the per-formance data on the chosen ASCI Purple codes.

4.3.1 Hardware and Software

Again, the HPCx Service [21] was used to gather all of the data for this section of the project. Therefore,all of the earlier descriptions on the machine hardware and software implementations detailed for theJacobi Code (Section 3.3) continue to hold true here.

The only differences for the ASCI Purple codes arise from the compiler options employed. First, sinceUMT2K and sPPM use Fortran90 and Fortran77 respectively, it was necessary for some compilation tobe performed with the IBM Fortran compiler. Currently, version 8.1.1.0 of the XL Fortran compiler isavailable on HPCx, aliased to xlf for f77 compilation and xlf90 for f90. As is the case for the Ccompiler, MPI is included at compilation with the alternate aliases mpxlf_r and mpxlf90_r withthe r indicating thread-safe code.

The three principal compiler optimisations (with either C or Fortran) remain: -O3 -qarch=pwr4-qtune=pwr4; their functions are the same for these codes as for the Jacobi benchmark. Note that-q64 had to be dropped from all of the ASCI Purple benchmarks, as it caused either compilation orruntime errors with all of the codes. However, imposing -O3 on all codes ensures the same standardof serial optimisation has been enforced throughout the project; the option applies roughly the sameoptimisations to either C or Fortran code.

In addition to these compiler flags, and obviously the OpenMP library flag where necessary, bothUMT2K and sPPM made use of additional flags that were required for correct functionality. Somewere to do with additional Fortran-only options like -qinitauto, which automatically sets all vari-ables to zero. The more important flags dealt with memory use, particularly -bmaxdatawhich assignsmore memory to a running code (used in both benchmarks), and -qautodbl=dbl4which promotesall floating-point variables to double storage (used in sPPM), which take up 8 bytes on HPCx.

4.3.2 Experimental Procedure

The process followed for gathering performance data on the three benchmark codes was not as exten-sive as that obeyed for the Jacobi code. Each code is instrumented to differing degrees: sPPM is verythoroughly monitored, with each of 13 stages per double-timestep being individually timed in additionto each double-timestep itself; by contrast, UMT2K only provides timer data on the total executiontime of the code, and one other section which appears to account for most of the runtime in any case;SMG2000 occupies the middle ground, with three internal sections being separately timed. This vari-ance in the quantity and meaning of the available data made standardising a procedure over all threecodes somewhat problematic.

In addition to the timer issue, different levels of user input are permissible for each code. Two importantpoints present themselves here: processor allocation; and input problem sizes. For the former, SMG2000and sPPM both allow the user to specify the MPI process decomposition in three dimensions, allowing

58

for a very large possible choice of runs for even a modest number of processors. However, UMT2Konly permits the total number of processes to be given which limits the potential for study. For the latterpoint, all three codes have very different resident global sizes in memory, and in some cases the precisedetails are not fully known; in addition, the input sets display varying degrees of customisability withSMG2000 being easily configurable by the user at one end of the scale, and UMT2K’s input sets beingby contrast much harder to control given their complexity.

All this means that enforcing as rigorous a standard of experimental procedure as was used for the Jacobicode was not practical given the differences in the benchmarks. Also, due to CPU allocation limits onthe HPCx Service it was not possible to gather as much raw timer data for averaging purposes. Instead,each code was initially tested with only single runs in order to get a basic feel for their performancecharacteristics; if further experimentation was then warranted, additional runs were performed later.

59

4.4 Results and Analysis

This section will present each benchmark study separately, in the order SMG2000, UMT2K, and finallysPPM. A brief summary of the results in the context of a higher-level Mixed Mode analysis will then bepresented.

4.4.1 SMG2000

The first test of this benchmark code was performed on 4 LPARs only with both Mixed and Pure MPIversions. SMG2000 reads in all of its input from the command line, allowing for a data-block-size perMPI process to be specified along with some other equation constants and the process decomposition.For this trial the block size was set to give a constant total problem size for all runs, as the blocks wereincreased for Mixed runs so that the actual per-processor size remained the same. Whilst the exactresulting size in memory could not be determined, it was known that the size per processor was largeenough to place the problem in main memory.

For the process decompositions, values were chosen that gave 1D decompositions in each dimensionfor both Mixed and MPI, and also a decomposition that was as square/cubic as possible (rotated in threedimensions). Mixed runs were always performed with 8 threads per process. Two timer values, labelledSetup and Solve, are listed as important in the benchmark readme; these values are provided in Table4.1.

Decomposition Setup (s) Solve (s)32 × 1 × 1 2.803 4.8861 × 32 × 1 1.120 2.0911 × 1 × 32 0.794 1.5774 × 4 × 2 4.235 4.5894 × 2 × 4 4.029 4.1332 × 4 × 4 3.146 3.647

Decomposition Setup (s) Solve (s)4 × 1 × 1 4.867 23.5331 × 4 × 1 3.234 18.0481 × 1 × 4 3.050 20.1242 × 2 × 1 4.124 21.7632 × 1 × 2 3.892 20.9671 × 2 × 2 3.298 18.439

Table 4.1: Initial SMG2000 run, performed on 4 LPARs with the same global problem size. Left handtable shows Pure MPI times, and right hand shows Mixed.

In addition, all runs were performed using the MPI Trace Tools available on the system [28], whichrecords the elapsed time spent in calls to MPI Library functions and hence gives a more complete pictureof the underlying communication pattern of the code. This data will be referred to where appropriate,but has not been included explicitly due to the quantity of it (records of MPI calls are collected on eachprocess).

This data shows us the following:

• The Pure MPI code appears to favour 1D decompositions over 3D, and process topologies thatfavour the z-direction. The overall times appear to be quite sensitive to the chosen process ar-rangement, with the difference between the best and worst times of the order of a factor of 5 forthe Setup phase and 3 for the Solve phase. This is likely due to the fact that over half of the totalruntime is spent in MPI calls according to the Trace data, principally the point-to-point functionsMPI_Isend, MPI_Irecv, and MPI_Waitall. The construction of the data-planes that are

60

being sent around seems therefore to favour these process topologies, probably due to the waythey are stored in memory, as we saw for the Jacobi code.

• The Mixed code again favours 1D decompositions over the 2D used in the “square” (one cannotbuild a 3D topology using only 4 processes), but now the y and z directions seem to be preferred.The runtimes are considerably less sensitive to the topology chosen with the Mixed code, withbest to worst ratios being only 1.6 for the Setup and 1.3 for the Solve, due to the much smallercontribution of the MPI comms to the overall runtime – only around 2.5 seconds is spent in Mixedpoint-to-point communications.

• Clearly, the stand-out feature of the results is that the Mixed code performs terribly when com-pared to the Pure MPI data. The best to worst ratios give a factor of 6 slowdown for the Setupphase when using the Mixed code, and a factor of 15 slowdown for the Solve phase. This has noth-ing whatsoever to do with the underlying communications, as the Mixed code performs about atenth as much communication as the Pure MPI code, and with all messages lying well below theEager Limit this results in the Mixed comms actually running faster.

Instead, the problem is tied up with the computations taking place in the code. As the Readmementions, the SMG2000 code only performs 1-2 computations per memory access so it is possiblethat the Cache Leak effect seen with the Jacobi code is again in play here, as that had a greatereffect with threaded code. However, a factor of 15 slowdown is an enormous performance hitto be attributed solely to this effect, particularly given that the differences in the Jacobi runtimeswere nowhere near this severe. Indeed, it actually appears as if the OpenMP is doing nothing atall and instead the 4 MPI processes are trying to solve the problem on their own – this wouldobviously result in slower runtimes.

One possible explanation as to why the Mixed version runs so slow is that the OpenMP does not scaleat all, due to some poorly designed feature of the code. The SMG2000 source itself was thereforethe next approach, as a search of the available literature indicated that the benchmark had only beenused in its Pure MPI form for other studies (it has a rather desirable communication pattern as an MPIcode as discussed in Vetter and Yoo [15]). Investigation led to the discovery that all OpenMP use wasperformed simply with parallel for directives, which were used multiple times throughout thecode across many different loops. Profiled runs indicated that the underlying behaviour of the codechanged dramatically with the inclusion of OpenMP, with two threaded routines called SMGResidualand CyclicReduction increasing their percentage share of the overall runtime by a considerable fraction.In addition, the OpenMP introduced a completely new routine called Thdcode, which accounted for afurther 15% of the total runtime.

The most telling result from the profiles was the appearance of the OpenMP function_xlsmp_DynamicChunkCall. This was called by all the thread-outlined functions, and is a clearindication that the OpenMP is actually running inside the code. One final study, based on the source codeafter all macro-expansion had taken place, suggested that a zero-scaling implementation was indeed thesource of the problem, as the OpenMP appeared to be operating at far too fine-grained a level. Thisresults in overhead from the OpenMP library completely overwhelming any possible speedup, and henceSMG2000 was not explored any further as it was clear that its Mixed version would never be able tocompete with the Pure MPI at any level.

61

4.4.2 UMT2K

The first test of this benchmark was performed on 4 LPARs only, for both Mixed and MPI versions(although recall that an MPI run of UMT2K is in fact a Mixed run with 1 thread per process, as noseparate build option is available). The same input files were used for both versions. Note that there wasno indication as to how big this problem was in memory, but given how long the problem took to run(see below), it is probably a fair assumption that it resides in main memory.

No process topologies can be specified for UMT2K, so only the total number of MPI processes is re-quired. The Mixed run was performed with 8 threads per process as standard. The benchmark readmeindicates two reported times that are of interest, Wallclock and Angle-Loop-Only, and these are bothreported; note that UMT2K provides its timer output in minutes rather than seconds. This data is pre-sented in Figure 4.1, with the raw data tabulated in the Appendix (A.2.1). As before with the SMG2000code, these runs were also performed using the MPI Trace Tool.

Figure 4.1: Results of a 4 LPAR run for UMT2K with MPI (left) and Mixed (right) versions.

These results indicate that once again the Mixed version of the benchmark runs slower than the equiv-alent Pure MPI. Both the total execution time (the Wallclock figure) and the primary computation time(the Angle-Loop-Only, which refers to the total time spent in the loop which decomposes the angularordinates i.e. the only place in the code that uses OpenMP) both show a performance drop when movingto the Mixed code.

According to the MPI Trace files, very little of the total runtime comes from communication, averagingaround 9 seconds in the Pure MPI, and around 11 seconds in the Mixed code. This difference is dueentirely to calls made to MPI Barrier; both parallel implementations make the same number of calls to

62

the routine, but the Mixed code spends longer doing it. This suggests that the Mixed’s workload is lesswell distributed than the Pure MPI for some reason, but it cannot really be explained at this stage.

It appears that the computation of the Mixed code is the principal culprit behind the performance loss,potentially due to poorly scaling OpenMP once again. To examine the effectiveness of the OpenMPimplementation, a single LPAR run was performed in order to evaluate the Pure OpenMP version ofUMT2K. Runs for MPI on 8 processes and Mixed on 1 process with 8 threads and 2 processes with 4threads per process were also obtained. This data is presented in Figure 4.2 and in the Appendix.

Figure 4.2: Results of a 1 LPAR run for UMT2K with OpenMP (left), Mixed (centre), and MPI (right).

Again, we see the OpenMP and 1p × 8t Mixed codes suffer the most in terms of performance, with theMixed code running slightly slower due to the overhead from the MPI (according to MPI Trace). As thenumber of processes in the Mixed code is increased, the performance improves, and with the OpenMPreduced to only one thread per process (a “Pure” MPI run) the best performance is seen; overall, the1p × 8t Mixed code runs around 16% slower than the MPI.

This strongly suggests that the Mixed code is suffering due to the scalability of the OpenMP loop in thekernel (recall that the OpenMP use in the UMT2K code amounts to a single parallel for directiveacross one loop in a kernel file). In order to investigate this further, additional timer calls were placedin the file containing the OpenMP loop – called snflwxyz.c – designed to record the amount of timespent in a single run of this loop for each MPI process; the routine contained in this file was calledmultiple times during a run of UMT2K, and presumably the Angle-Loop-Only value already calculatedthe total amount of time spent in this loop for a complete run.

Note that this extra timer call was unable to record loop times in the Pure OpenMP version of UMT2K,

63

because the terminal output from the timer was lost for such runs. This was due to an internal conflictregarding the redirection of STDOUT for the rank 0 process, and could not be resolved. However, thetimer functioned as required for Mixed (and hence one-thread MPI) runs of the code.

The altered code was then re-run for MPI, OpenMP, and Mixed versions to determine both the amountof time spent in the OpenMP loop and also the scalability of the three versions. Runs were performedon 1 LPAR for 1, 2, 4, and 8 processors, with the Mixed code set to always run with 1 MPI processbut with increasing numbers of threads. This was performed using the same input data set as before,but with a single parameter called tmax reduced in order to reduce the total runtime: this was done inorder to obtain manageable runtimes for 1 processor runs. The code’s own timer data for these runs isshown in Figures 4.3, 4.4, and 4.5 with tabulated results in the Appendix; the additional loop times periteration for the Mixed and MPI runs are given in Table 4.2.

No. Processors Time (min)1 0.1542 0.0794 0.0408 0.022

No. Processors Time (min)1 0.1572 0.0834 0.0448 0.026

Table 4.2: Time spent in the OpenMP loop per iteration per process. Left table is for MPI, and right isfor Mixed.

Figure 4.3: Results of a 1 LPAR run for UMT2K with MPI for 1, 2, 4, and 8 processors with one processper processor.

64

Figure 4.4: Results of a 1 LPAR run for UMT2K with Mixed for 1, 2, 4, and 8 processors with oneprocess per LPAR and 1 thread per processor.

A number of points can be extracted from this data:

• From Table 4.2, we can see the gradual decrease in relative efficiency between the MPI and Mixedversions in terms of processes versus threads. For the one processor case, the two versions’ loopsrun at approximately the same speed. However, for 2 processors the two-thread Mixed versionof the loop runs 5% slower than the corresponding two-process MPI version. This becomes 10%for 4 processors, and finally 18% for 8 processors, which is comparable to the 16% slowdownseen between the 1p×8t Mixed and 8-process MPI codes seen earlier. This demonstrates that thethreaded version of this loop runs slower than the MPI, which suggests that the OpenMP simplydoes not scale as well as the MPI for the UMT2K benchmark.

• The scalability of the three versions of the code appears to be quite good overall. The 8 processMPI code has a speedup factor of 5.4, whilst the 8-thread Mixed and OpenMP codes achievefactors of 5.4 and 5.3 respectively. In fact, the results of the Mixed and OpenMP codes areconsiderably better than expected given the previous results (and the per iteration times of theOpenMP loop discussed above).

However, the reason for this apparent improvement is straightforward: the threaded codes getiteratively worse, meaning that the longer they run for the more apparent the performance dropbecomes. Resetting the value of tmax (which controls how many iterations of the OpenMP-kernel loop are performed amongst other things) to the value used earlier results in the relativeperformance of the threaded versions dropping in comparison to the MPI code, as the effects

65

Figure 4.5: Results of a 1 LPAR run for UMT2K with OpenMP for 1, 2, 4, and 8 processors with onethread per processor.

of the poorly scaling OpenMP loop are felt for longer. Indeed, with the larger value of tmaxin place, the speedup of the MPI code remains at 5.4 for the 8 process run, but the Mixed andOpenMP factors fall to 4.6 and 4.7 respectively.

This is clearly a serious problem for the Mixed (and OpenMP) versions of UMT2K. The problemset chosen for this benchmark study was a considerably reduced version of a standard test case;typical runs of the UMT2K code in fact take around 30 minutes on 8 processors for an MPI run.This means that for a “normal” UMT2K run, the iterative slowdown of the OpenMP loop will bemuch more pronounced given the longer runtimes involved, and the performance of the Mixedcode will suffer considerably.

• An interesting anomaly appears in the 4 processor results, as the OpenMP and Mixed codes bothoutperform the MPI by around 9% in terms of overall runtime. An investigation of the MPI Tracefiles showed a similar spike in time spent in MPI Barrier as was seen for the 4 LPAR Mixed runconducted at the start of this study.

In order to investigate this further, MPI runs on 4, 8, 16, and 32 processes were performed andthe time spent in MPI Barrier calls examined. Using the metric:

average time spent in an MPI Barrier per processtotal runtime

the 4 process run scored 0.057, whilst the 8, 16, and 32 process runs scored 0.045, 0.081, and0.094 respectively. This suggests that the mesh created for a 4 process decomposition is less well-

66

balanced than the others, as one would expect this metric to simply show a steady increase as thenumber of processes is increased. This also explains why the 4 process Mixed run at the start ofthis study showed a slowdown in MPI communication time compared to the 32 process MPI run– the underlying MPI decomposition was simply unbalanced.

A fairly clear picture of the performance characteristics of the UMT2K code is now available. Apartfrom the rogue unbalanced mesh decomposition obtained for 4 processes, the MPI code always outper-forms the Mixed or OpenMP codes simply due to the fact that the OpenMP does not scale particularlywell. As a final confirmation of this hypothesis, runs of the Mixed and MPI versions were performedfor 2, 4, and 8 LPARs in order to build up a better picture of the codes’ performance on larger numbersof processors. These results are presented in Figure 4.6, and tabulated in the Appendix as normal.

Figure 4.6: Results of a 2, 4, and 8 LPAR runs for UMT2K with Mixed (one process per LPAR and 8threads per process) and MPI (8 processes per LPAR) versions.

These results paint the same picture as before, with the Mixed code consistently outperformed by theMPI for all runs. The 4 process Mixed run does not seem to be particularly affected by the poor meshdecomposition compared to the overall Mixed-to-MPI trends, but this is simply because the UMT2Kcode spends much more of its time performing computations within the poorly scaling OpenMP Angleloop than it does in MPI communications.

As noted in the code description earlier, the Benchmark Readme points out that memory access patternsvary significantly between ordinates, and since the OpenMP work decomposition takes place over theordinates this may be the inherent limitation in the scalability. The mesh decomposition in the MPI codemay reduce this effect on account of it getting smaller mesh portions than with the Mixed code, whichmay affect either the number or arrangement of ordinates per mesh unit. However, without a detailed

67

analysis of the UMT2K source, it is difficult to draw any firm conclusions about this.

These results demonstrate that there is no benefit in running the UMT2K code under a Mixed parallelimplementation. Indeed, given that the MPI runs performed in this section were in fact Mixed runs withone thread per process, it is likely that a genuine Pure MPI run would show even better performancewith the overhead of the threaded section removed.

If it were possible to make use of a more sophisticated OpenMP implementation in the kernel, as theinclusion of a single parallel for directive is a rather basic development of a Mixed implementa-tion, then it might be possible to generate threaded code that scales better than the corresponding MPI.However, such code design alterations lie outside the scope of this project given the size and complexityof the UMT2K benchmark.

4.4.3 sPPM

As for the previous two ASCI Purple benchmarks, the first test of the sPPM code was performed on 4LPARs only for both Mixed and Pure MPI versions – note that unlike the case for UMT2K, Pure MPIreally means that the code is compiled without any OpenMP threaded sections. The Mixed code wasrun in the standard way, with 1 process per LPAR and 8 threads per process. For these runs, a varietyof different process decompositions were tested: 1D topologies in all three dimensions were used forboth versions; and 3D (MPI) and 2D (Mixed - since one cannot make a 3D topology from 4 processes)decompositions that gave topologies as close to a cube (square) as possible were also tested, rotatedthrough all three dimensions.

This problem was set to give an identical tile (i.e. per process) size, and the same overall problemsize – this means that the tile size was multiplied by 2 in each direction when moving from MPI toMixed, since the number of processes is reduced by a factor of 8. This unfortunately meant that theglobal problem geometry was different for each run (although the total problem size remained the samefor all runs), as the global shape is determined in the code by multiplying the tile size by the processdecomposition. This means that whilst the total size in memory remained the same for all runs, the codeis actually solving a slightly different problem each time the process topology is changed. From theprogram output obtained, it appears that the total problem size is just larger than the L3 cache per node.

These results are presented in Figure 4.7, and are tabulated in the Appendix (A.2.2); these runs wereagain performed under MPI Trace. Note that only one piece of timer data, the total Wallclock runtimeof the code, is presented in these graphs; whilst sPPM outputs timer data for each of 13 separate sec-tions per double-timestep, these runs were only performed for one such double-timestep. Hence in-runcomparisons of these sections was not possible, and in terms of comparing Mixed and MPI performancethe total runtime is the most interesting figure.

From this data, the following trends can be seen:

• The Mixed implementation of sPPM favours 1D process decompositions over 2D, and accord-ing to the MPI Trace files this appears to be mostly due to a saving in the communication time.sPPM makes only a very small number of MPI communication calls (which are virtually allMPI_Isends,MPI_Irecvs, and MPI_Waits for Point-to-Point communications) per run, andthese messages are quite large, lying significantly over the Eager Limit. With a 2D decomposi-tion more messages have to be sent, which results in an increase in the amount of time spent in

68

Figure 4.7: Results of a 4 LPAR run for sPPM with Mixed (top – one process per LPAR and 8 threads perprocess) and MPI (bottom – 8 processes per LPAR) versions. The different MPI process decompositionsused are displayed on the x-axis, grouped with 1D decompositions on the left and 2D/3D on the right.

69

communications and hence an increase in the total runtime of the code.

It also appears from this data that the Mixed communication pattern favours process decomposi-tions in the z-direction. However, given that the underlying problem geometry changes with eachprocess decomposition chosen, it is difficult to make any firm conclusions.

It should be noted that this analysis of the communication differences, which is based purelyon MPI Trace files at this stage, is incomplete. The version of sPPM used underwent a coderevision after its initial release (adding MPI_Cancel calls before the MPI_Finalize), andthis alteration sometimes causes the code to fail on HPCx just before MPI_Finalize is reached.This does not affect any of the timer output sent to the terminal, and hence does not affect anyof the results presented in this section; however it did result in MPI Trace failing to produce anydata for runs that encountered this problem, as it cannot summarise the MPI use without reachingMPI_Finalize in the code. These failed MPI_Cancel calls were impossible to predict, so itwas simply necessary to make do with whatever MPI Trace data was available.

• The Pure MPI code also favours 1D process decompositions over 3D, and for the same reasons asthe Mixed version. All Pure MPI message sizes continue to lie over the Eager Limit irrespectiveof the decomposition chosen, but more messages must be sent around with a 3D topology; hence1D decompositions are faster.

This difference is more pronounced in the Pure MPI code simply because there are more processesinvolved in the communication. Each process makes roughly the same number of MPI calls for1D and 2D/3D topologies between the Mixed and Pure MPI versions, and since there are moreprocesses with the Pure MPI code the overall drop in performance is greater when moving awayfrom the 1D decomposition, as considerably more messages must be sent in comparison to theMixed code.

It is difficult to make any kind of conclusion on a preferred direction for the decomposition inthe Pure MPI code. The 1D appears to favour the x-direction and oppose the z, with the 3Ddecomposition preferring the x and z directions over the y. However, again due to the underlyingshift in global topology, it is not really possible to make a firm commitment here.

• The stand-out feature of these results is that for the first time in the project, the Mixed code showsbetter performance results than the Pure MPI. This is true across the board, with even the worstMixed process decomposition (2 × 1 × 2) running faster than the best Pure MPI decomposition(32 × 1 × 1).

It appears from the available Trace files that this speedup is due primarily to a reduction in com-munication time: the Mixed code has less processes involved in communication, so its communi-cations take less time. In addition to this, the Mixed code also appears to be sending around lessaggregate data than the Pure MPI code, although the reasons for this are not yet clear.

In order to build up a clearer picture of the communication pattern in sPPM (since this appears re-sponsible for the performance improvement seen from using Mixed Mode), it was necessary to directlyinstrument the source, since the code did not always produce MPI Trace Files. This was done in twoparts: first the actual shape of the process topology was determined by adding code that outputted therank of every neighbour for each process; second, the MPI communication calls were instrumented toprint out the source and destination ranks and the message size. These studies are considered in turn;

70

note that all further runs of sPPM used variable tile sizes in order to keep the global problem shapeconstant and hence solve the same problem for all cases.

1. To analyse the communication shape of sPPM, runs were conducted on 1 LPAR with the extraneighbour-rank output routine in place. This was done purely to determine the shape of thedecompositions used, hence no performance data is included from these runs.

This simply demonstrated that the process topologies generated were as expected. For 2, 4, and8 processes the decompositions assigned were 2 × 1 × 1, 2 × 2 × 1, and 2 × 2 × 2 respectively.These led to the creation of the 1D, 2D, and 3D arrangements shown in Figure 4.8. Note that thetopologies are periodic in the x and y directions, and non-periodic in z.

0 1

110

y

x

z 2 33

11

32

0

y

x

z

6 7

2 33

11

7

5

32

0

y

x

z

Figure 4.8: Process topologies created for a 2×1×1 decomposition (top left), a 2×2×1 decomposition(top right), and a 2 × 2 × 2 decomposition (bottom) in sPPM. Numbers shown on each individual cubeindicate the rank of each process; note that rank 4 in the 2 × 2 × 2 cube is obscured.

2. Instrumenting the actual hard-coded MPI communication calls was much harder than expected.All such calls take place in the file bdrys.f, generated from the un-preprocessed bdrys.m4source file. However, there were a large number of these calls, as there appeared to be many dif-ferent sending options possible inside the code. It cannot be the case that sPPM simply performsa basic halo-swap between processes, as the underlying communication pattern is far too complexfor that. However, without an extensive study of all the various communication subroutines it isnot possible to present a clear picture of the pattern in use.

It was decided to instrument only the MPI_Isend calls in the bdrys.m4 file, simply becausethere were less of them compared to the receive and wait calls. Runs were again performed on 1LPAR with the Mixed and Pure MPI versions in order to observe the communication pattern; theresulting output is not reproduced here due to the quantity of data obtained.

These instrumented runs demonstrated that the reduction in MPI processes when using Mixed

71

code had exactly the expected effect, with a 4 processes per LPAR and 2 threads per processMixed run performing half as much communications as an 8 process Pure MPI run, and a 2processes per LPAR and 4 threads per process Mixed run performing half as much again.

In addition, some of the Mixed communication subroutines showed the expected “half the numberof communications, double the resulting message sizes” behaviour. However, many more ofthe Mixed communications were able to send exactly the same amount of data around as thecorresponding routines in the Pure MPI code; this is a clear indication that the Mixed code hasbeen able to replace some of its communications with direct reads/writes to memory.

This claim drops naturally out of an examination of the process topology. Referring to Figure 4.8,and taking the 2×1×1 shape to refer to a Mixed code (on 1 LPAR with 2 processes and 4 threadsper process say) and the 2 × 2 × 2 shape to be an MPI code (on 1 LPAR with 8 processes say),we see that for the same global problem size the Mixed data “cubes” must be larger than the PureMPI. This means that where in the Pure MPI code we had process boundaries and hence explicitMPI communications, we now have same-process data and hence direct reads/writes inside thedata-cubes in the Mixed code.

One further examination of the sPPM code is necessary to complete the picture. Following the flowof control through the code, the primary kernel routines were identified as six functions called hydxy,hydyz etc. called from the function runhyd. These six function calls are made inside a parallelregion in the Mixed version of sPPM, and have the following pseudo-code structure:

c$omp do schedule(dynamic)do (some work)

if ((threadid .eq. 1) .and. other conditions)make MPI communication calls

endif

perform mathematical operations

enddoc$omp end do nowait

The same core kernels are present in the Pure MPI code, but the two OpenMP directives have beenremoved; in the Pure MPI code, threadid is a constant always equal to 1. This is a good exampleof the replacement of MPI calls with direct reads/writes, since with the Pure MPI code every processis involved in communication whereas in a Mixed code only the master thread on each process makesthese calls.

The do loops have nowaits attached because the barriers are forced between the six hyd functioncalls; hence synchronicity is maintained. As this pseudo-code indicates, the Mixed Mode implementa-tion chosen is still equivalent to the masteronly style employed in Jacobi code, as MPI communicationonly ever occurs on the thread with the threadid of 1 (which is the master thread in the sPPM code,as thread labels are set with OMP_GET_THREAD_NUM() +1).

However, the sPPM code is more sophisticated than that. The do loops are decomposed using dynamicscheduling, where threads are initially given a chunk-size of iterations (equal to 1 iteration in the case

72

of sPPM), and are then dynamically assigned the next chunk as they finish their current iteration space.This means that threads can perform different numbers of iterations in a single do loop, with fasterthreads being given more to do. A study of the pseudo-code therefore indicates that the sPPM kernelsare actually overlapping their communication and computations calls: since the master thread will spendlonger in one of its iterations given that it must make calls to the communication library, it will performless iterations overall and the other threads will cover the difference.

This is quite a pleasingly simple way of obtaining overlapping communication and computation, whichis normally rather complicated to implement, and can be done here because each iteration of the mainkernel loops must be independent. This was not the case with the Jacobi code, where the results of themain iteration loop were dependent on those from previous iterations.

In order to fully test the sPPM code, more extensive runs were then performed; clearly, the same perfor-mance characteristics between the parallel versions is expected. First, runs of the as-yet untested PureOpenMP version of sPPM, along with Mixed and Pure MPI, were performed on 1 LPAR. These resultsare presented in Figure 4.9, and in the Appendix (A.2.2). Note again that for these runs the tile-sizeper MPI process was altered between runs, in order to keep the global problem geometry constant andhence ensure that the same problem was being solved at all times. The problem size chosen resided inmain memory.

Figure 4.9: Results of a 1 LPAR run for sPPM with OpenMP, Mixed and MPI versions. All runs wereperformed on 8 processors: OpenMP used 8 threads; Mixed used a process decomposition of 1 × 1 × 1and 8 threads; and MPI used a process decomposition of 2 × 2 × 2.

These results demonstrate the superiority of OpenMP over MPI for the sPPM code, with the formerrunning around 45% faster. The Mixed code is performing the same work decomposition as the Pure

73

OpenMP code and simply includes the overhead of adding in the MPI calls; so from these results itcan be deduced that the overhead from this addition must be very small given the comparatively smalldifference in runtime.

This is a good indication that the performance gains seen in the Mixed code in previous runs really docome from a clever use of OpenMP in the computation kernels, rather than from a poor use of MPI.Unfortunately there is a distinct lack of corroborating data in the available literature, as it once againappears that very little use of sPPM in either Mixed or Pure OpenMP versions has been made. Stewart[18] notes that the best performance obtained from sPPM on a 16-processor SMP node was obtainedwith a Mixed run using 1 process and 16 threads, but the overwhelming amount of data gathered on thisbenchmark is restricted to studies of the Pure MPI version.

To complete the study of sPPM, runs were performed with the Mixed and Pure MPI versions on 1, 2, 4,and 8 LPARs using as cubic an MPI decomposition as possible. These results are shown in Figure 4.10,and in the Appendix as normal.

As these graphs show, the Mixed Mode version continues to dominate for all runs attempted. Theperformance is consistently around 35% better when using the Mixed version on more than 1 LPARin comparison to the Pure MPI code. However, the overall relative scalability is very good for bothversions, with the runtime roughly halving when the number of processors assigned is doubled. Thissuggests that both the Mixed and Pure MPI codes have been designed to scale very well; the Mixed codeis just fundamentally faster.

In summary, the Mixed Mode version of sPPM has three advantages over the Pure MPI version, whichleads to an increase in performance. One is that the Mixed code is able to send less messages around dueto a reduction in the number of MPI processes needed; this shows up more in the sPPM performancestudies than for any previous code because all sPPM messages lie above the Eager Limit and henceneed the slower rendezvous-protocol comms on HPCx. The second is that the Mixed sPPM is sendingless aggregate data around, as it replaces explicit MPI communication calls with direct read/writesto memory. The third advantage arises in the Mixed code’s ability to overlap communications andcomputations, which is denied to the Pure MPI code given the lack of dynamic scheduling.

74

Figure 4.10: Results of 1, 2, 4, and 8 LPAR runs for sPPM with Mixed (top – one process per LPAR and8 threads per process) and MPI (bottom – 8 processes per LPAR) versions. The different MPI processdecompositions used are displayed on the x-axis.

75

4.4.4 Summary

The three ASCI Purple Benchmarks studied all demonstrated very different performance characteristicswhen comparing their Mixed Mode and Pure MPI versions. SMG2000 implements its OpenMP in arather naive manner that results in the thread parallelism occurring at far too fine-grained a level, givingextremely poor performance when compared to a Pure MPI run. UMT2K, on the other hand, employs avery basic form of OpenMP parallelism that is perhaps too simplistic given the complexity of the code,and this results in the Mixed code scaling poorly compared to the Pure MPI. Finally, sPPM uses a cleverOpenMP implementation that allows for a very elegant method of obtaining overlapped communicationand computation, and is the only code studied in the project that gives clearly superior performance withits Mixed Mode version.

This is perhaps an example of sPPM finding the right balance between MPI and OpenMP. SMG2000tries too hard to break down its work across the threads, and this results in the Mixed code spending mostof its time generating unnecessary overhead from the OpenMP library. UMT2K performs a very simpledecomposition over a kernel loop that is described as having variable memory access patterns across theiterations, and hence this use of OpenMP may simply be too basic for the problem in question.

sPPM marries the complexity of both the code and the problem with the kernel’s parallel structure interms of both the OpenMP and the MPI, in such a way as to give better Mixed Mode performance. Thisis a good indication of the increased level of design complexity needed to produce an efficient MixedMode code, and hence the greater amount of work required when actually building one, but if the timeis invested then the Mixed code obtained can outperform a comparable Pure MPI code on a ClusteredSMP system.

76

Chapter 5

Conclusions

This chapter first gives a summary of the main conclusions drawn from the project. A postmortem ofthe project as a whole is then presented, followed by some suggestions for further work.

5.1 Project Summary

Throughout this project, it has typically been the case that code developed under the Mixed ModeMPI+OpenMP model has been outperformed on the available Clustered SMP System by a comparablePure MPI code. Two of the three ASCI Purple Benchmarks, SMG2000 and UMT2K, and practicallyall runs of the various versions of the Jacobi code, all demonstrated this trend. In the case of the Jacobicode, this was despite the Pure OpenMP code actually being considerably faster than the Pure MPIcode for intra-node studies. However, the OpenMP implementations of SMG2000 and UMT2K weredemonstrably slower than their Pure MPI partners, so in the case of these benchmarks it is not verysurprising that the resulting Mixed codes do not perform very well.

Nevertheless, the results are not entirely negative regarding Mixed Mode programming. The third ASCIPurple Benchmark under consideration, sPPM, displayed a performance gain of approximately 35%when moving from a Pure MPI implementation to Mixed MPI+OpenMP. This gain was attributable tothree factors in sPPM: the threaded code was able to overlap its computation and communication acrossthe on-node threads; the code was able to send less aggregate data in Mixed Mode, as explicit com-munication between processors was being replaced by direct reads/writes to memory; and finally, lesscommunication was required in the Mixed Mode code, since there were less MPI processes assigned.

The third reason is essentially generic to all Mixed MPI+OpenMP codes, as one of the principal featuresof this model is the replacement of processes with threads inside SMP nodes. However, less functioncalls from the MPI Library does not necessarily mean that the Mixed Mode’s communication will ac-tually run faster overall, as was the case with the Jacobi code’s Point-to-Point section. This thereforehighlights the two main requirements in a Mixed Mode MPI+OpenMP code in terms of performance:the underlying OpenMP must be an equivalent if not better choice for the computation when comparedto the MPI; and the OpenMP must be able to handle the inter-node communication in as efficient amanner as possible.

77

Here, the SMG2000 and UMT2K benchmarks fail the first of these requirements. The former employsOpenMP at far too fine-grained a level, resulting in practically no speedup for threaded code. The latterutilises a decomposition that is possibly too basic given the underlying memory access patterns of themain kernel, resulting in iteratively poorer speedup for threaded computation compared to un-threaded.Given that the MPI can outperform the OpenMP in these codes at the computation level, which goes onto dominate the overall execution time, communication benefits of the Mixed code are too minimal (ifindeed they are present) to have any effect.

The Jacobi code, on the other hand, fails the second of these requirements. Technically it fails the firstof them as well, but as the hardware analysis indicated this was due to a system-specific Cache Leakeffect being worse with threaded sections compared to unthreaded. Since this effect should not havebeen taking place at all, this deficiency in Mixed Mode cannot be corrected at the software design level.However, regarding the interaction between the threads and the communication calls, we saw that theCollective routines (which none of the ASCI Purple Benchmarks studied made any great use of) ranfaster even with the relatively simple masteronly style employed. This was due to the reduced numberof processes involved in the communication, and also from part of the collective function taking placevia an OpenMP reduction.

In contrast, the Point-to-Point communications were faster with the Pure MPI implementation. Thisresulted from a greater level of communication parallelism in the Pure MPI code, with all processorscommunicating with active processes; in the Mixed Mode code, only the node processors running masterthreads make any MPI calls. In addition, the masteronly style of communication suffers from poor cacheuse during the halo-swapping, as the master thread does not necessarily own the data to be sent and mustinstead spend additional time gathering it from other processors.

Only the sPPM code meets both of the performance requirements for its Mixed Mode version. ThePure OpenMP code shows a marked improvement in performance compared to the Pure MPI whichdemonstrates its computational suitability, (a trait which was also witnessed in the Jacobi code but whichmust have been outweighed by other factors in the Mixed Mode) and the communication implementationhas been designed in such a way as to allow for overlapped functionality. This latter feature means thatthe parallelism of the communications is maintained, as threads which are not involved in MPI calls arebusy getting on with other things.

This is a clear indication that it is often necessary to consider both requirements when designing a MixedMode MPI+OpenMP code, unless the code is seriously dominated by either communications or com-putations. This means that developing a Mixed Mode code that is expected to outperform a comparablePure MPI code will almost always require a much greater investment from the programmer in terms oftime and functionality testing – simply inserting a few OpenMP directives is unlikely to produce an effi-cient Mixed implementation. However, if both parts of the code are given sufficient consideration whendesigning the threaded sections, performance gains are indeed witnessed on a Clustered SMP System.

5.2 Postmortem

Given that this was a 16-week project, the timetable for deliverables was as follows. The first 7 weekswere spent on the development of the main benchmark code; this included all data gathering and perfor-mance studies in addition to actual code construction. One week of this 7 week module was also spentgathering information from the available literature on Mixed Mode programming. The next module, the

78

ASCI Purple Benchmarks, was allocated 4 weeks as this study was conducted at a higher level than thefirst. Finally, 5 weeks were assigned to the production of the written report.

This workplan for this project was essentially adhered to without any problems. In terms of code devel-opment (for the Jacobi code), and familiarisation (for the ASCI Purple codes), no unexpected problemswere encountered. For the former, the simplicity of the underlying Jacobi code made additional paralleldevelopment and debugging quite straightforward. For the latter, the documentation available with thebenchmarks greatly facilitated their ease of use.

The biggest problem encountered with this project was actually in the analysis stage. The volumeof performance data available, both from the code output itself and from the additional system toolsemployed, was far greater than had been originally anticipated. For the timer data, a few days had to bediverted from the Jacobi schedule for the sudden necessity of learning Perl, which was needed to developsome home-grown analysis software; the resulting time saved in assembling the timer data more thanmade up for this unexpected side-track. However, the system tools’ data was almost always analysedby hand, which in practical terms meant poring over reams and reams of printouts. This was clearly aless than ideal approach, but given the schedule constraints of the project it was difficult to dedicate anymore time to the development of the extra analysis software needed.

A second problem was with usage of the HPCx Service. Whilst the Jacobi code was subject to fairlyextensive testing, the ASCI Purple analysis was based on much less data particularly with regards togathering repeated results of runs in order to ensure reproducibility of performance times. This cameabout due to an unexpected reduction of account hours that turned out to be not as restrictive as was atfirst thought, but by the time everything regarding this had settled there was no longer sufficient timeleft in the 4-week ASCI Purple allocation to make up for the shortfall of code runs.

Were this project to be repeated with hindsight, the only significant change regarding the workplanwould be a dedicated assignment of time to the development of data analysis software in Perl. Thiswould have required a basic familiarisation period with all of the various system tools used in order toget used to the data formats, but this would not have been a bad thing to have done in any case. With amore extensive array of software to simplify the analysis process, not only would the existing series oftests have proceeded quicker but there is the possibility that even more performance analysis could havebeen conducted. In addition, better use could have been made of the available HPCx account time afterthe restrictions were imposed, which would have allowed for more reliable ASCI Purple performancedata to be assembled.

5.3 Future Work

For the Jacobi code, the most interesting future development would be an alteration to the OpenMP im-plementation, with the intention of improving the efficiency behind the interaction between the threadsand the MPI library. This would require a change in the fundamental design style behind the MixedMode code, as masteronly would have to be abandoned in favour of individually communicating threads.However, such a code may give superior performance to the Pure MPI version for the reasons outlinedearlier.

Furthermore, this highlights another area of development work that could be pursued: an investigation ofnon-masteronly Mixed codes in general. Much of the existing literature on Mixed Mode Programming

79

revolves around the masteronly style, as it is often the easiest to implement given a functional PureMPI code. However, as we have seen, this form Mixed MPI+OpenMP code is often unable to providesufficiently good performance, which makes a study of more sophisticated design methods all the moreinteresting. In addition to the possibility of fully communicating threads, other styles such as equal-levelMPI and OpenMP could also be examined.

This would not answer a key question of Mixed Mode Programming: whether the gains in performancethat can be obtained are sufficient to offset the extra time needed for code development. The only wayto really explore this would be to perform a design study of a code parallelised in both MPI and MixedMPI+OpenMP, and compare the total time spent in development against the performance gain obtained.Unfortunately this is very hard to do in practice, as applying standardised metrics to the developmentprocess is a somewhat inexact science.

For the ASCI Purple codes, the most immediate requirement would be to gather more runtime data,which could then be used to improve the analysis of the three benchmarks tested. Another use of thesuite would be to test some of the other benchmarks on the HPCx Service, to see how the various MixedMode implementations faired against their Pure MPI equivalents. This again opens another more generalarea of future work, which is simply to test more Mixed Mode codes, as a larger pool of results wouldhelp build a better picture of Mixed performance on clustered systems. Codes could be taken from otherestablished benchmark suites in addition to ASCI Purple, and this could grant access to different formsof Mixed Mode in addition to MPI+OpenMP – for example, MPI and POSIX threads, or some of thevariants on masteronly referred to above.

One final line of work that could be pursued is a more thorough investigation of the interaction betweenMPI and OpenMP, with the intention of developing a better understanding of the relationship betweenthese two parallel models at the software implementation level. A more complete study would makeanalysing the performance characteristics of Mixed MPI+OpenMP codes easier, and could help simplifythe design process if such interactions could be anticipated.

80

Appendix A

Tabulated Data

A.1 The Jacobi Code

This section contains all of the timer output data from the Jacobi code referred to in Section 3.4. Thisdata is not the raw program output; instead it gives only the average values for each code section perrun. All times in this section are in seconds.

A.1.1 Fixed Problem Size

Pure Codes

I J Algorithm Delta Update Total1 7 23.60 ± 0.20 15.42 ± 0.29 11.73 ± 0.23 50.80 ± 0.727 1 14.94 ± 0.03 8.00 ± 0.06 6.58 ± 0.04 29.55 ± 0.121 8 22.34 ± 0.08 14.57 ± 0.08 11.50 ± 0.06 48.46 ± 0.232 4 17.75 ± 0.13 11.42 ± 0.12 8.64 ± 0.10 37.85 ± 0.344 2 15.12 ± 0.18 9.01 ± 0.20 7.04 ± 0.20 31.20 ± 0.568 1 13.70 ± 0.05 7.58 ± 0.16 6.04 ± 0.12 27.36 ± 0.31

Table A.1: Fixed Problem; OpenMP; 1 LPAR

I J P-t-P Algorithm Delta Collective Update Total1 7 1.52 ± 0.14 17.30 ± 0.08 9.03 ± 0.15 1.13 ± 0.24 5.97 ± 0.18 35.15 ± 0.177 1 0.71 ± 0.05 17.17 ± 0.04 8.56 ± 0.06 1.04 ± 0.11 5.57 ± 0.06 33.26 ± 0.031 8 1.40 ± 0.07 15.41 ± 0.09 8.21 ± 0.17 1.21 ± 0.11 5.16 ± 0.16 31.64 ± 0.442 4 1.11 ± 0.07 15.24 ± 0.05 8.11 ± 0.08 1.12 ± 0.10 5.01 ± 0.09 30.84 ± 0.194 2 0.96 ± 0.09 15.24 ± 0.12 7.99 ± 0.21 1.31 ± 0.36 4.93 ± 0.20 30.68 ± 0.798 1 0.62 ± 0.07 15.27 ± 0.13 7.97 ± 0.26 1.17 ± 0.15 5.02 ± 0.29 30.29 ± 0.75

Table A.2: Fixed Problem; MPI; 1 LPAR

81

Mixed vs. MPI

I J P-t-P Algorithm Delta Collective Update Total1 4 1.79 ± 0.02 4.88 ± 0.01 1.93 ± 0.005 1.25 ± 0.03 1.57 ± 0.005 11.58 ± 0.052 2 1.93 ± 0.04 4.91 ± 0.01 1.87 ± 0.01 1.13 ± 0.03 1.57 ± 0.003 11.58 ± 0.074 1 1.49 ± 0.04 4.89 ± 0.01 1.89 ± 0.005 1.25 ± 0.05 1.59 ± 0.01 11.26 ± 0.10

Table A.3: Fixed Problem; Mixed; 4 LPARs


32 1 0.62 ± 0.02 3.56 ± 0.01 1.54 ± 0.01 2.61 ± 0.06 0.91 ± 0.01 9.52 ± 0.06

Table A.4: Fixed Problem; MPI; 4 LPARs

I J P-t-P Algorithm Delta Collective Update Total1 8 1.59 ± 0.04 2.66 ± 0.01 1.10 ± 0.002 1.25 ± 0.07 0.96 ± 0.01 7.72 ± 0.112 4 1.72 ± 0.06 2.64 ± 0.01 1.06 ± 0.003 1.27 ± 0.08 0.96 ± 0.01 7.81 ± 0.154 2 1.77 ± 0.19 2.72 ± 0.02 1.09 ± 0.01 1.34 ± 0.24 0.98 ± 0.02 8.06 ± 0.458 1 1.30 ± 0.08 2.59 ± 0.13 1.09 ± 0.06 1.22 ± 0.10 0.95 ± 0.05 7.31 ± 0.38

Table A.5: Fixed Problem; Mixed; 8 LPARs


16 4 1.26 ± 0.09 5.29 ± 0.01 0.88 ± 0.002 3.69 ± 0.53 0.50 ± 0.003 11.87 ± 0.6664 1 0.62 ± 0.04 1.79 ± 0.02 0.85 ± 0.01 3.09 ± 0.48 0.63 ± 0.01 7.26 ± 0.54

Table A.6: Fixed Problem; MPI; 8 LPARs


32 1 0.61 ± 0.01 4.35 ± 0.01 1.61 ± 0.01 2.27 ± 0.04 1.23 ± 0.01 10.33 ± 0.04

Table A.7: Fixed Problem; Mixed with 1 thread per process; 4 LPARs

82

Pure OpenMP Studies

Threads Algorithm Delta Update Total7 15.54 ± 0.26 6.93 ± 0.07 5.45 ± 0.06 27.96 ± 0.278 14.58 ± 0.18 7.12 ± 0.20 5.29 ± 0.10 27.02 ± 0.39

Table A.8: Fixed Problem; OpenMP version 1; 1 LPAR

Threads Algorithm Delta Update Total7 13.40 ± 0.10 8.64 ± 0.17 5.75 ± 0.04 28.62 ± 0.258 11.77 ± 0.14 7.13 ± 0.07 5.43 ± 0.05 25.37 ± 0.13






83

Improved Mixed vs. MPI


Table A.12: Fixed Problem; Mixed version 2; 4 LPARs


Table A.13: Fixed Problem; Mixed version 1; 4 LPARs


32 1 0.61 ± 0.01 3.53 ± 0.01 1.61 ± 0.01 2.36 ± 0.05 1.17 ± 0.004 9.56 ± 0.05

Table A.14: Fixed Problem; Mixed version 2 with 1 thread per process; 4 LPARs

84

A.1.2 L3 Scaling Problem Size

Small

Threads Algorithm Delta Update Total8 9.923 ± 0.048 5.660 ± 0.122 4.944 ± 0.091 20.911 ± 0.249

Table A.15: L3 Cache fit with Collectives on; OpenMP; 1 LPAR


Table A.16: L3 Cache fit with Collectives off; OpenMP; 1 LPAR

I J P-t-P Algorithm Delta Collective Update Total1 1 0.038 ± 0.000 12.371 ± 0.029 6.962 ± 0.031 0.100 ± 0.00 1 4.939 ± 0.038 24.513 ± 0.079

Table A.17: L3 Cache fit with Collectives on; Mixed; 1 LPAR

I J P-t-P Algorithm Delta Collective Update Total1 1 0.040 ± 0.000 12.479 ± 0.031 0.000 ± 0.000 0.000 ± 0.00 0 6.236 ± 0.033 18.839 ± 0.066

Table A.18: L3 Cache fit with Collectives off; Mixed; 1 LPAR


Table A.19: L3 Cache fit with Collectives on; MPI; 1 LPAR


Table A.20: L3 Cache fit with Collectives off; MPI; 1LPAR

85

Medium

I J P-t-P Algorithm Delta Collective Update Total1 2 3.868 ± 0.105 12.812 ± 0.069 6.791 ± 0.074 0.533 ± 0.099 4.902 ± 0.075 29.008 ± 0.1982 1 0.791 ± 0.047 12.646 ± 0.055 6.961 ± 0.055 0.540 ± 0.069 5.001 ± 0.052 26.041 ± 0.057

Table A.21: L3 Cache fit with Collectives on; Mixed; 2 LPARs


Table A.22: L3 Cache fit with Collectives off; Mixed; 2 LPARs


16 1 0.546 ± 0.055 12.601 ± 0.043 6.675 ± 0.052 1.503 ± 0.124 4.533 ± 0.062 25.974 ± 0.047

Table A.23: L3 Cache fit with Collectives on; MPI; 2 LPARs


16 1 1.558 ± 0.196 12.640 ± 0.051 0.000 ± 0.000 0.000 ± 0.000 5.069 ± 0.254 19.317 ± 0.575

Table A.24: L3 Cache fit with Collectives off; MPI; 2 LPARs





86


32 1 0.659 ± 0.074 12.908 ± 0.065 6.933 ± 0.057 2.328 ± 0.172 4.846 ± 0.086 27.785 ± 0.089



32 1 2.206 ± 0.246 12.941 ± 0.089 0.000 ± 0.000 0.000 ± 0.000 5.400 ± 0.168 20.598 ± 0.030


Large

Note that the 16 LPAR results in this section were originally run for only 2000 iterations. The data pre-sented in the tables has been scaled up by a factor of 2.5, in order to make the 16 LPAR data comparablewith all the other runs of the L3 Scaling Problem Size.






16 4 1.326 ± 0.116 12.850 ± 0.139 7.172 ± 0.122 3.332 ± 0.257 4.716 ± 0.146 29.508 ± 0.39764 1 0.928 ± 0.085 12.588 ± 0.055 6.554 ± 0.045 3.997 ± 1.345 4.535 ± 0.106 28.716 ± 1.547


87


16 4 3.469 ± 1.166 13.124 ± 0.098 0.000 ± 0.000 0.000 ± 0.000 5.859 ± 0.217 22.509 ± 1.22764 1 3.552 ± 0.884 12.807 ± 0.079 0.000 ± 0.000 0.000 ± 0.000 5.301 ± 0.132 21.717 ± 0.844



16 1 1.885 ± 0.280 12.540 ± 0.053 6.923 ± 0.098 3.448 ± 0.558 4.775 ± 0.113 29.675 ± 0.618



16 1 2.900 ± 0.323 12.738 ± 0.055 0.000 ± 0.000 0.000 ± 0.000 6.145 ± 0.108 21.858 ± 0.310



128 1 0.923 ± 0.088 12.468 ± 0.120 6.463 ± 0.100 5.403 ± 0.755 4.325 ± 0.108 29.698 ± 0.915



128 1 5.703 ± 4.400 12.475 ± 0.180 0.000 ± 0.000 0.000 ± 0.000 4.605 ± 0.433 22.955 ± 3.935


Point-to-Point Communication Study

I J P-t-P Algorithm Delta Collective Update Total16 1 0.678 ± 0.198 12.486 ± 2.265 6.882 ± 1.605 2.055 ± 0.580 4.966 ± 1.053 27.169 ± 5.654

Table A.37: L3 Cache fit with Collectives on; Mixed on 4 LPARs with 2 threads

88


Table A.38: L3 Cache fit with Collectives on; Mixed on 4 LPARs with 4 threads

A.1.3 L2 Scaling Problem Size

Small


Table A.39: L2 Cache fit with Collectives on; OpenMP; 1 LPAR


Table A.40: L2 Cache fit with Collectives off; OpenMP; 1 LPAR


Table A.41: L2 Cache fit with Collectives on; Mixed; 1 LPAR


Table A.42: L2 Cache fit with Collectives off; Mixed; 1 LPAR


Table A.43: L2 Cache fit with Collectives on; MPI; 1 LPAR


Table A.44: L2 Cache fit with Collectives off; MPI; 1 LPAR

89

Medium






16 1 0.563 ± 0.028 6.068 ± 0.164 2.672 ± 0.081 1.661 ± 0.114 1.540 ± 0.048 12.790 ± 0.383



16 1 1.541 ± 0.050 6.056 ± 0.164 0.000 ± 0.000 0.000 ± 0.000 1.577 ± 0.060 9.272 ± 0.269






90


16 4 1.454 ± 0.053 6.082 ± 0.016 2.684 ± 0.021 3.308 ± 0.173 1.590 ± 0.028 15.378 ± 0.26932 1 0.635 ± 0.036 6.073 ± 0.017 2.725 ± 0.026 2.774 ± 0.410 1.549 ± 0.023 14.013 ± 0.437



32 1 1.845 ± 0.361 6.056 ± 0.090 0.000 ± 0.000 0.000 ± 0.000 1.549 ± 0.037 9.547 ± 0.442


Large

Note that the L2 Scaling Problem Size was not run on 16 LPARs.






16 4 1.454 ± 0.053 6.082 ± 0.016 2.684 ± 0.021 3.308 ± 0.173 1.590 ± 0.028 15.378 ± 0.26964 1 0.887 ± 0.062 6.092 ± 0.092 2.660 ± 0.054 4.290 ± 0.631 1.554 ± 0.037 15.749 ± 0.790


91


16 4 2.484 ± 0.249 6.080 ± 0.017 0.000 ± 0.000 0.000 ± 0.000 1.663 ± 0.032 10.347 ± 0.30164 1 3.097 ± 0.725 6.092 ± 0.055 0.000 ± 0.000 0.000 ± 0.000 1.608 ± 0.039 10.902 ± 0.757


A.2 ASCI Purple Benchmarks

This section contains all of the timer data referred to in Section 4.4. This data is raw timer outputfrom the benchmark codes, as insufficient runs were performed with which to gather average values.SMG2000 is not covered here, as all data gathered for that code has been included in Section 4.4.1 intabulated form already.

A.2.1 UMT2K

Parallel Mode MPI Processes Threads per Process Wallclock (min) Angle-Loop-Only (min)MPI 32 1 1.265 1.053

Mixed 4 8 1.492 1.130

Table A.57: Initial 4 LPAR run for MPI and Mixed

Parallel Mode MPI Processes Threads per Process Wallclock (min) Angle-Loop-Only (min)OpenMP 1 8 1.402 1.212

Mixed 1 8 1.429 1.227Mixed 2 4 1.351 1.162MPI 8 1 1.236 1.064

Table A.58: 1 LPAR run for OpenMP, Mixed and MPI

Parallel Mode MPI Processes Threads per Process Wallclock (min) Angle-Loop-Only (min)MPI 1 1 1.999 1.853MPI 2 1 1.030 0.942MPI 4 1 0.666 0.580MPI 8 1 0.369 0.322

Table A.59: 1 LPAR run for MPI with reduced tmax

Parallel Mode MPI Processes Threads per Process Wallclock (min) Angle-Loop-Only (min)Mixed 1 1 2.037 1.888Mixed 1 2 1.102 1.006Mixed 1 4 0.607 0.533Mixed 1 8 0.380 0.315

Table A.60: 1 LPAR run for Mixed with reduced tmax

92

Parallel Mode MPI Processes Threads per Process Wallclock (min) Angle-Loop-Only (min)OpenMP 1 1 2.003 1.861OpenMP 1 2 1.070 0.979OpenMP 1 4 0.612 0.539OpenMP 1 8 0.378 0.312

Table A.61: 1 LPAR run for OpenMP with reduced tmax

Parallel Mode MPI Processes Threads per Process Wallclock (min) Angle-Loop-Only (min)MPI 16 1 2.560 2.339

Mixed 2 8 2.890 2.481MPI 32 1 1.271 1.055

Mixed 4 8 1.492 1.152MPI 64 1 0.663 0.537

Mixed 8 8 0.786 0.548

Table A.62: Final run on 2, 4, and 8 LPARs for MPI and Mixed

A.2.2 sPPM

Parallel Mode x y z Wallclock (s)Mixed 4 1 1 7.315Mixed 1 4 1 7.311Mixed 1 1 4 7.217Mixed 2 2 1 7.459Mixed 2 1 2 7.461Mixed 1 2 2 7.309

Table A.63: Mixed Mode run on 4 LPARs with varying process decompositions; 1 process per LPARand 8 threads per process

Parallel Mode x y z Wallclock (s)MPI 32 1 1 8.516MPI 1 32 1 8.823MPI 1 1 32 9.077MPI 4 4 2 10.784MPI 4 2 4 10.063MPI 2 4 4 10.979

Table A.64: Pure MPI run on 4 LPARs with varying process decompositions; 8 processes per LPAR

Parallel Mode x y z Wallclock (s)OpenMP 1 1 1 16.465

Mixed 1 1 1 16.601MPI 2 2 2 23.989

Table A.65: 1 LPAR run for OpenMP (8 threads), Mixed (1 process, 8 threads per process), and MPI (8processes)

93

Parallel Mode x y z Wallclock (s)Mixed 1 1 1 16.601Mixed 2 1 1 8.824Mixed 2 2 1 4.471Mixed 2 2 2 2.269

Table A.66: Mixed Mode run on 1, 2, 4, and 8 LPARs, with 1 process per LPAR and 8 threads perprocess

Parallel Mode x y z Wallclock (s)MPI 2 2 2 23.989MPI 4 2 2 12.016MPI 4 4 2 6.122MPI 4 4 4 3.032

Table A.67: Pure MPI run on 1, 2, 4, and 8 LPARs, with 8 processes per LPAR

94

Bibliography

[1] L. Smith and M. Bull; Development of Mixed Mode MPI/OpenMP Applications; Scientific Pro-gramming, Vol.9, No. 2-3, 83-98; 2001.http://www.epcc.ed.ac.uk/ markb/docs/sci_prog01.pdf

[2] J. Hein and M. Bull; Capability Computing: Achieving Scalability on over 1000 Processors;HPCx Technical Report HPCxTR0301; 2003.http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0301.pdf

[3] D. Klepacki; Mixed-Mode Programming; IBM ACTC Workshop for Power3 presentation, SPSciComp ’99; 1999.http://www.mhpcc.edu/doc/ACTC/MpiOpenMP.pdf

[4] D. Salmond; ECMWF’s IFS on an IBM p690 system with 960 Power4 processors; ECMWFpresentation at Third UKHEC Annual Seminar; 2002.http://www.ukhec.ac.uk/events/annual2002/salmond.pdf

[5] E Chow and D. Hysom; Assessing Performance of Hybrid MPI/OpenMP Programs on SMPClusters; Lawrence Livermore National Laboratory technical report UCRL-JC-143957; 2001.http://www.llnl.gov/CASC/people/chow/pubs/hpaper.ps

[6] L. Giraud; Combining Shared and Distributed Memory Programming Models on Clusters ofSymmetric Multiprocessors: Some Basic Promising Experiments; The International Journal ofHigh Performance Computing Applications, Volume 16, No. 4, pp 425-430; 2002.http://www.paulchapmanpublishing.co.uk/journals/details/issue/sample/a031089.pdf

[7] F. Cappello and D. Etiemble; MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks;in the proceedings of SC2000, Dallas; 2000.http://www.sc2000.org/techpapr/papers/pap.pap214.pdf

[8] Y. He and C.H.Q. Ding; An Evaluation of MPI and OpenMP Paradigms for Multi-DimensionalData Remapping; in the proceedings of WOMPAT 2003, LNCS 2716, pp 195-210; 2003.

[9] S. Bova, C. Breshears, R. Eigenmann, H. Gabb, G. Gaertner, B. Kuhn, B. Margo, S. Salvini,and V. Vatsa; Combining Message-passing and Directives in Parallel Applications; SIAM News,Volume 32, Number 9; 1999.http://www.siam.org/siamnews/11-99/mpi.pdf

95

[10] R. Rabenseifner; Hybrid Parallel Programming: Performance Problems and Chances; in theproceedings of 45th CUG Conference, Columbus; 2003.http://www.hlrs.de/people/rabenseifner/publ/cug2003_hybrid_1.pdf

[11] R. Rabenseifner and G. Wellein; Communication and Optimization Aspects of Par-allel Programming Models on Hybrid Architectures; The International Journal ofHigh Performance Computing Applications, Volume 17, No. 1, pp 49-62; 2003.http://www.hlrs.de/people/rabenseifner/publ/hybrid_HPCA_3.pdf

[12] R. Rabenseifner; Communication Bandwidth of Parallel Programming Models on HybridArchitectures; in the proceedings of WOMPEI 2002, part of ISHPC-IV, Kyoto; 2002.http://www.hlrs.de/people/rabenseifner/publ/hybrid_wompei2002final.pdf

[13] A. Jackson; Mixed-Mode Parallelisation of a Discrete Element Model; MSc in HPC Dissertation,University of Edinburgh, UK; 2002.

[14] D.S. Henty; Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Dis-crete Element Modeling; in the proceedings of SC2000, Dallas; 2000.http://www.sc2000.org/techpapr/papers/pap.pap154.pdf

[15] J.S. Vetter and A. Yoo; An Empirical Performance Evaluation of Scalable Scientific Applications;in the proceedings of SC2002, Baltimore; 2002.http://sc-2002.org/paperpdfs/pap.pap222.pdf

[16] D.H. Ahn and J.S. Vetter; Scalable Analysis Techniques for Microprocessor Performance CounterMetrics; in the proceedings of SC2002, Baltimore; 2002.http://sc-2002.org/paperpdfs/pap.pap257.pdf

[17] P.R. Woodward, S.E. Anderson, D.H. Porter, D. Dinge, I. Sytine, T. Ruwart, M. Jacobs, R.H.Cohen, B.C. Curtis, W.P. Dannevik, A.M. Dimits, D.E. Eliason, A.A. Mirin, K. Winkler, and S.Hodson; Exploiting the Power of DSM and SMP Clusters for Parallel CFD; in the proceedingsof International Parallel CFD 1999 Conference, Williamsburg; 1999.http://kingfish.coastal.edu/physics/ddinge/parcfd99.pdf

[18] M. Stewart; Single Node Optimization Techniques on the NERSC SP; NERSC User ServicesGroup presentation; 2002.http://hpcf.nersc.gov/computers/SP/nodopt.html

[19] J.S. Vetter; Dynamic Statistical Profiling of Communications Activity in Distributed Applications;in the proceedings of SIGMETRICS: Joint International Conference on Measurement and Mod-eling of Computer Systems; 2002.http://www.llnl.gov/CASC/people/vetter/pubs/sigmetrics02.pdf

[20] Top 500 Supercomputer Sites;http://www.top500.org/

[21] HPCx Capability Computing;http://www.hpcx.ac.uk

96

[22] ASCI White;http://www.llnl.gov/asci/platforms/white

[23] ASCI Purple;http://www.llnl.gov/asci/purple

[24] ASCI Q;http://www.llnl.gov/asci/platforms/lanl_q/

[25] MPI: A Message-Passing Interface Standard; Message Passing Interface Forum, June 1995;http://www.mpi-forum.org

[26] OpenMP C and C++ Application Program Interface; OpenMP ARB, March 2002;http://www.openmp.org

[27] Hardware Performance Monitor (HPM) Toolkit;http://www.hpcx.ac.uk/support/documentation/IBMdocuments/HPM.html

[28] MPI Trace Tools;http://www.hpcx.ac.uk/support/documentation/IBMdocuments/mpitrace

97

mixed mode programming on a clustered smp...

Documents