mpi-acc: accelerator-aware mpi for scientific applications · content may change prior to final...

14
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 1 MPI-ACC: Accelerator-Aware MPI for Scientific Applications Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Karthik Murthy, Milind Chabbi, Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng, John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur. Abstract—Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to- end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC’s runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI- ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization. Index Terms—Heterogeneous (hybrid) systems, Parallel systems, Distributed architectures, Concurrent programming 1 I NTRODUCTION Graphics processing units (GPUs) have gained widespread use as general-purpose computational accelerators and have been studied extensively across a broad range of scientific applications [1], [2], [3]. The presence of general-purpose accelerators in high-performance computing (HPC) clusters has also steadily increased, and 15% of today’s top 500 fastest supercomputers (as of November 2014) employ general-purpose accelerators [4]. Nevertheless, despite the growing prominence of acceler- ators in HPC, data movement on systems with GPU accel- erators remains a significant problem. Hybrid programming with the Message Passing Interface (MPI) [5] and the Compute Unified Device Architecture (CUDA) [6] or Open Computing Language (OpenCL) [7] is the dominant means of utilizing GPU clusters; however, data movement between processes is currently limited to data residing in the host memory. The ability to interact with auxiliary memory systems, such as GPU memory, has not been integrated into such data movement standards, thus leaving applications with no direct mechanism to perform end-to-end data A. M. Aji, L. S. Panwar, and W. Feng are with the Dept. of Comp. Sci., and K. Bisset is with Virginia Bioinformatics Inst., Virginia Tech. E-mail: {aaji, lokendra, feng}@cs.vt.edu, [email protected] P. Balaji, J. Dinan, and R. Thakur are with the Math. and Comp. Sci. Div., Argonne National Lab. E-mail: {balaji, dinan, thakur}@mcs.anl.gov K. Murthy, M. Chabbi, and J. Mellor-Crummey are with the Dept. of Comp. Sci., Rice Univ. E-mail: {ksm2, mc29, johnmc}@rice.edu F. Ji and X. Ma are with the Dept. of Comp. Sci., N. C. State Univ. E-mail: [email protected], [email protected] movement. Currently, transmission of data from accelerator memory must be done by explicitly copying data to host memory before performing any communication operations. This process impacts productivity and can lead to a severe loss in performance. Significant programmer effort would be required to recover this performance through vendor- and system-specific optimizations, including GPUDirect [8] and node and I/O topology awareness. We introduce MPI-ACC, an integrated and extensible framework that provides end-to-end data movement in accelerator-based clusters. MPI-ACC significantly improves productivity by providing a unified programming interface, compatible with both CUDA and OpenCL, that can allow end-to-end data movement irrespective of whether data re- sides in host or accelerator memory. In addition, MPI-ACC allows applications to easily and portably leverage vendor- and platform-specific capabilities in order to optimize data movement performance. Our specific contributions in this paper are as follows. An extensible interface for integrating auxiliary mem- ory systems (e.g., GPU memory) with MPI An efficient runtime system, which is heavily opti- mized for a variety of vendors and platforms (CUDA and OpenCL) and carefully designed to minimize contention with existing workloads An in-depth study of high-performance simulation codes from two scientific application domains (compu- tational epidemiology [9], [10] and seismology mod- eling [11]) We evaluate our findings on HokieSpeed, a state-of-

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 1

MPI-ACC: Accelerator-Aware MPI for Scientific ApplicationsAshwin M. Aji, Lokendra S. Panwar, Feng Ji, Karthik Murthy, Milind Chabbi,

Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng,John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur.

Abstract—Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remainsa challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface(MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, arenot integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement inaccelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memoryspaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to otheroffload models as well. MPI-ACC’s runtime system enables several key optimizations, including pipelining of data transfers,scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can beused to design new communication-computation patterns in scientific applications from domains such as epidemiology simulationand seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art clusterwith hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-awareMPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall clusterutilization.

Index Terms—Heterogeneous (hybrid) systems, Parallel systems, Distributed architectures, Concurrent programming

F

1 INTRODUCTIONGraphics processing units (GPUs) have gained widespreaduse as general-purpose computational accelerators and havebeen studied extensively across a broad range of scientificapplications [1], [2], [3]. The presence of general-purposeaccelerators in high-performance computing (HPC) clustershas also steadily increased, and 15% of today’s top 500fastest supercomputers (as of November 2014) employgeneral-purpose accelerators [4].

Nevertheless, despite the growing prominence of acceler-ators in HPC, data movement on systems with GPU accel-erators remains a significant problem. Hybrid programmingwith the Message Passing Interface (MPI) [5] and theCompute Unified Device Architecture (CUDA) [6] or OpenComputing Language (OpenCL) [7] is the dominant meansof utilizing GPU clusters; however, data movement betweenprocesses is currently limited to data residing in the hostmemory. The ability to interact with auxiliary memorysystems, such as GPU memory, has not been integrated intosuch data movement standards, thus leaving applicationswith no direct mechanism to perform end-to-end data

• A. M. Aji, L. S. Panwar, and W. Feng are with the Dept. of Comp. Sci.,and K. Bisset is with Virginia Bioinformatics Inst., Virginia Tech.E-mail: {aaji, lokendra, feng}@cs.vt.edu, [email protected]

• P. Balaji, J. Dinan, and R. Thakur are with the Math. andComp. Sci. Div., Argonne National Lab. E-mail: {balaji, dinan,thakur}@mcs.anl.gov

• K. Murthy, M. Chabbi, and J. Mellor-Crummey are with the Dept. ofComp. Sci., Rice Univ. E-mail: {ksm2, mc29, johnmc}@rice.edu

• F. Ji and X. Ma are with the Dept. of Comp. Sci., N. C. State Univ.E-mail: [email protected], [email protected]

movement. Currently, transmission of data from acceleratormemory must be done by explicitly copying data to hostmemory before performing any communication operations.This process impacts productivity and can lead to a severeloss in performance. Significant programmer effort wouldbe required to recover this performance through vendor- andsystem-specific optimizations, including GPUDirect [8] andnode and I/O topology awareness.

We introduce MPI-ACC, an integrated and extensibleframework that provides end-to-end data movement inaccelerator-based clusters. MPI-ACC significantly improvesproductivity by providing a unified programming interface,compatible with both CUDA and OpenCL, that can allowend-to-end data movement irrespective of whether data re-sides in host or accelerator memory. In addition, MPI-ACCallows applications to easily and portably leverage vendor-and platform-specific capabilities in order to optimize datamovement performance. Our specific contributions in thispaper are as follows.

• An extensible interface for integrating auxiliary mem-ory systems (e.g., GPU memory) with MPI

• An efficient runtime system, which is heavily opti-mized for a variety of vendors and platforms (CUDAand OpenCL) and carefully designed to minimizecontention with existing workloads

• An in-depth study of high-performance simulationcodes from two scientific application domains (compu-tational epidemiology [9], [10] and seismology mod-eling [11])

We evaluate our findings on HokieSpeed, a state-of-

Page 2: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 2

the-art hybrid CPU-GPU cluster housed at Virginia Tech.Microbenchmark results indicate that MPI-ACC can pro-vide up to 48% improvement in two-sided GPU-to-GPUcommunication latency. We show that MPI-ACC’s designdoes not oversubscribe the GPU, thereby minimizing con-tention with other concurrent GPU workloads. We demon-strate how MPI-ACC can be used in epidemiology andseismology modeling applications to easily explore andevaluate new optimizations at the application level. Inparticular, we overlap MPI-ACC CPU-GPU communica-tion calls with computation on the CPU as well as theGPU, thus resulting in better overall cluster utilization.Results indicate that the MPI-ACC–driven communication-computation patterns can help improve the performanceof the epidemiology simulation by up to 13.3% and theseismology modeling application by up to 44% over thetraditional hybrid MPI+GPU models. Moreover, MPI-ACCdecouples the low-level memory optimizations from theapplications, thereby making them scalable and portableacross several architecture generations. MPI-ACC enablesthe programmer to seamlessly choose between CPU, GPU,or any accelerator device as the communication target, thusenhancing programmer productivity.

This paper is organized as follows. Section 2 intro-duces the current MPI and GPU programming modelsand describes the current hybrid application programmingapproaches for CPU-GPU clusters. We discuss related workin section 3. In Section 4, we present MPI-ACC’s designand its optimized runtime system. Section 5 explains theexecution profiles of the epidemiology and seismologymodeling applications, their inefficient default MPI+GPUdesigns, and the way GPU-integrated MPI can be usedto optimize their performances while improving produc-tivity. In Section 6, we evaluate the communication andapplication-level performance of MPI-ACC. Section 7 eval-uates the contention impact of MPI-ACC on concurrentGPU workloads. Section 8 summarizes our conclusions.

2 MOTIVATION

In this section, we describe the issues in the traditionalCPU-GPU application design and illustrate how GPU-integrated MPI can help alleviate them.

2.1 Designing MPI+GPU Applications

Graphics processing units have become more amenableto general-purpose computations over the past few years,largely as a result of the more programmable GPU hardwareand increasingly mature GPU programming models, suchas CUDA [6] and OpenCL [7]. Today’s discrete GPUsreside on PCIe and are equipped with very high-throughputGDDR5 device memory on the GPU cards. To fully utilizethe benefits of the ultra-fast memory subsystem, however,current GPU programmers must explicitly transfer databetween the main memory and the device memory acrossPCIe, by issuing direct memory access (DMA) calls suchas cudaMemcpy or clEnqueueWriteBuffer.

1 computation_on_GPU(gpu_buf);2 cudaMemcpy(host_buf, gpu_buf, size, D2H ...);3 MPI_Send(host_buf, size, ...);

(a) Basic hybrid MPI+GPU with synchronous execution – highproductivity and low performance.

1 int processed[chunks] = {0};2 for(j=0;j<chunks;j++) {3 computation_on_GPU(gpu_buf+offset, streams[j]);4 cudaMemcpyAsync(host_buf+offset, gpu_buf+offset,5 D2H, streams[j], ...);6 }7 numProcessed = 0; j = 0; flag = 1;8 while (numProcessed < chunks) {9 if(cudaStreamQuery(streams[j] == cudaSuccess) {

10 MPI_Isend(host_buf+offset,...);/* start MPI */11 numProcessed++;12 processed[j] = 1;13 }14 MPI_Testany(...); /* check progress */15 if(numProcessed < chunks) /* find next chunk */16 while(flag) {17 j=(j+1)%chunks; flag=processed[j];18 }19 }20 MPI_Waitall();

(b) Advanced hybrid MPI+GPU with pipelined execution – lowproductivity and high performance.

1 for(j=0;j<chunks;j++) {2 computation_on_GPU(gpu_buf+offset, streams[j]);3 MPI_Isend(gpu_buf+offset, ...);4 }5 MPI_Waitall();

(c) GPU-integrated MPI with pipelined execution – high productivityand high performance.

Fig. 1: Designing hybrid CPU-GPU applications. For themanual MPI+GPU model with OpenCL, clEnqueueRead-Buffer and clEnqueueWriteBuffer would be used in place ofcudaMemcpy. For MPI-ACC, the code remains the same for allplatforms (CUDA or OpenCL) and supported devices.

The Message Passing Interface (MPI) is one of themost widely adopted parallel programming models fordeveloping scalable, distributed-memory applications. MPI-based applications are typically designed by identifyingparallel tasks and assigning them to multiple processes.In the default hybrid MPI+GPU programming model, thecompute-intensive portions of each process are offloadedto the local GPU. Data is transferred between processesby explicit messages in MPI. However, the current MPIstandard assumes a CPU-centric single-memory modelfor communication. The default MPI+GPU programmingmodel employs a hybrid two-staged data movement model,where data copies are performed between main memory andthe local GPU’s device memory that are preceded and/orfollowed by MPI communication between the host CPUs(Figures 1a and 1b). This is the norm seen in most GPU-accelerated MPI applications today [10], [12], [13]. Thebasic approach (Figure 1a) has less complex code, butthe blocking and staged data movement severely reduceperformance because of the inefficient utilization of thecommunication channels. On the other hand, overlappedcommunication via pipelining efficiently utilizes all thecommunication channels but requires significant program-mer effort, in other words, low productivity. Moreover,

Page 3: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 3

this approach leads to tight coupling between the high-level application logic and low-level data movement opti-mizations; hence, the application developer has to maintainseveral code variants for different GPU architectures andvendors. In addition, construction of such a sophisticateddata movement scheme above the MPI runtime systemincurs repeated protocol overheads and eliminates oppor-tunities for low-level optimizations. Moreover, users whoneed high performance are faced with the complexity ofleveraging a multitude of platform-specific optimizationsthat continue to evolve with the underlying technology (e.g,GPUDirect [8]).

2.2 Application Design Using GPU-Integrated MPIFrameworksTo bridge the gap between the disjointed MPI and GPUprogramming models, researchers have recently developedGPU-integrated MPI solutions such as our MPI-ACC [14]framework and MVAPICH-GPU [15] by Wang et al. Theseframeworks provide a unified MPI data transmission inter-face for both host and GPU memories; in other words, theprogrammer can use either the CPU buffer or the GPUbuffer directly as the communication parameter in MPIroutines. The goal of such GPU-integrated MPI platformsis to decouple the complex, low-level, GPU-specific datamovement optimizations from the application logic, thusproviding the following benefits: (1) portability – the ap-plication can be more portable across multiple acceleratorplatforms; and (2) forward compatibility – with the samecode, the application can automatically achieve perfor-mance improvements from new GPU technologies (e.g.,GPUDirect RDMA) if applicable and supported by the MPIimplementation. In addition to enhanced programmability,transparent architecture-specific and vendor-specific perfor-mance optimizations can be provided within the MPI layer.

Using GPU-integrated MPI, programmers need only towrite GPU kernels and regular host CPU codes for compu-tation and invoke the standard MPI functions for CPU-GPUdata communication, without worrying about the complexdata movement optimizations of the diverse acceleratortechnologies (Figure 1c).

3 RELATED WORK

MVAPICH [16] is an implementation of MPI based onMPICH [17] and is optimized for RDMA networks suchas InfiniBand. From v1.8 onward, MVAPICH has includedsupport for transferring CUDA memory regions across thenetwork (point-to-point, collective, and one-sided commu-nications), but its design relies heavily on CUDA’s UnifiedVirtual Addressing (UVA) feature. On the other hand,MPI-ACC takes a more portable approach: we supportdata transfers among CUDA, OpenCL, and CPU memoryregions; and our design is independent of library version ordevice family. By including OpenCL support in MPI-ACC,we automatically enable data movement between a varietyof devices, including GPUs from NVIDIA and AMD, IBMand Intel CPUs, Intel MICs, AMD Fusion, and IBM’s Cell

Broadband Engine. Also, we make no assumptions aboutthe availability of key hardware features (e.g., UVA) inour interface design, thus making MPI-ACC a truly genericframework for heterogeneous CPU-GPU systems.

CudaMPI [18] and GMH [19] are new libraries thatare designed to improve programmer productivity whenmanaging data and compute among GPUs across the net-work. Both these approaches are host-centric programmingmodels and provide new programming abstractions on topof existing MPI and CUDA calls. CaravelaMPI [20] isanother MPI-style library solution for data managementacross GPUs, but the solution is limited to the customCaravela API and not applicable to general MPI pro-grams. In contrast, MPI-ACC completely conforms to theMPI standard itself, and our implementation removes theoverhead of communication setup time, while maintainingproductivity.

DCGN [21] is a device-centric programming modelthat moves away from the GPU-as-a-worker programmingmodel. DCGN assigns ranks to GPU threads in the systemand allows them to communicate among each other byusing MPI-like library calls, and a CPU-based pollingruntime handles GPU control and data transfers. MPI-ACCis orthogonal to DCGN in that we retain the host-centricMPI communication and execution model while hiding thedetails of third-party CPU-GPU communication librariesfrom the end user.

Partitioned Global Address Space (PGAS) models, suchas Chapel, Global Arrays, and OpenSHMEM, provide aglobally shared memory abstraction to distributed-memorysystems. Researchers have explored extending PGAS mod-els to include GPUs as part of the shared-memory abstrac-tion [22], [23]. PGAS models can use MPI itself as theunderlying runtime layer [24], [25] and can be consideredas complementary efforts to MPI.

4 MPI-ACC: DESIGN AND OPTIMIZATIONS

In this section, we describe the design, implementation,and optimizations of MPI-ACC, the first interface thatintegrates CUDA, OpenCL, and other models within anMPI-compliant interface.

4.1 API DesignWe discuss MPI-ACC’s API design considerations andcompare the tradeoffs of our solution with the the APIdesign of MVAPICH. In an MPI communication call, theuser passes a void pointer that indicates the location ofthe data on which the user wishes to operate. To the MPIlibrary, a pointer to host memory is indistinguishable froma pointer to GPU memory. The MPI implementation needsa mechanism to determine whether the given pointer canbe dereferenced directly or whether data must be explicitlycopied from the device by invoking GPU library functions.Moreover, memory is referenced differently in differentGPU programming models. For example, CUDA memorybuffers are void pointers, but they cannot be dereferencedby the host unless Unified Virtual Addressing (UVA) is

Page 4: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 4

0.1

1

10

100

0 1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

Late

ncy

s)

Data Size (bytes)

Basic MPI

MPI + Datatype attribute check

MPI + Automatic detection

Fig. 2: Overhead of runtime checks incurred by intranode CPU-CPU communication operations. The slowdown due to automaticdetection (via cuPointerGetAttribute) can be up to 205% (aver-age: 127.8%), while the slowdown for the datatype attribute checkis at most 6.3% (average: 2.6%).

enabled. On the other hand, OpenCL memory buffersare represented as opaque cl mem handles that internallytranslate to the physical device memory location but cannotbe dereferenced by the host unless the buffer is mapped intothe host’s address space or explicitly copied to the host.

MVAPICH’s Automatic Detection Approach: MVAPICHallows MPI to deal with accelerator buffers by leveragingthe UVA feature of CUDA to automatically detect devicebuffers. This method requires no modifications to theMPI interface. Also, we have shown previously that whiletheir approach works well for standalone point-to-pointcommunication, programmers have to explicitly synchro-nize between interleaved and dependent MPI and CUDAoperations, thereby requiring significant programmer effortto achieve ideal performance [26]. Moreover, as shown inFigure 2, the penalty for runtime checking can be significantand is incurred by all operations, including those thatrequire no GPU data movement at all. Furthermore, theautomatic detection approach is not extensible for otheraccelerator models such as OpenCL that do not map GPUbuffers into the host virtual address space.

MPI-ACC’s Datatype Attributes Approach: MPIdatatypes are used to specify the type and layout ofbuffers passed to the MPI library. The MPI standard de-fines an interface for attaching metadata to MPI datatypesthrough datatype attributes. In MPI-ACC, we use theseMPI datatype attributes to indicate buffer type (e.g., CPU,CUDA, or OpenCL), buffer locality (e.g., which GPU), thestream or the event to synchronize upon, or just any otherinformation to the MPI library. With the datatype attributes,there is no restriction on the amount of information that theuser can pass to the MPI implementation. With our design,one can simply implicitly denote ordering of MPI and GPUoperations by associating GPU events or streams with MPIcalls, and the MPI-ACC implementation applies differentheuristics to synchronize and make efficient communicationprogress. We have shown in our prior work [26] that thisapproach improves productivity and performance, whilebeing compatible with the MPI standard. Moreover, ourapproach introduces a lightweight runtime attribute check toeach MPI operation, but the overhead is much less than withautomatic detection, as shown in Figure 2. Since MPI-ACCsupports both CUDA and OpenCL and since OpenCL iscompatible with several platforms and vendors, we considerMPI-ACC to be a more portable solution than MVAPICH.Table 1 summarizes the above differences.

TABLE 1: Comparison of MPI-ACC with MVAPICH.

MPI-ACC MVAPICHGPU Buffer MPI Datatype attributes void*Identification (UVA–based)MPI+GPU Synch. Implicit ExplicitMethodPerformance/Progress Automatic (through the MPI Manual (through thefor MPI+GPU Ops. implementation) programmer)

Software Platforms CUDA and OpenCL (any CUDA v4.0 andversion) newer only

Hardware PlatformsAMD (CPU, GPU and APU),

NVIDIA GPU onlyNVIDIA GPU, Intel (CPUand Xeon Phi), FPGA, etc

4.2 Optimizations

Once MPI-ACC has identified a device buffer, it leveragesPCIe and network link parallelism to optimize the datatransfer via pipelining. Pipelined data transfer parametersare dynamically selected based on NUMA and PCIe affinityto further improve communication performance.

4.2.1 Data Pipelining

We hide the PCIe latency between the CPU and GPUby dividing the data into smaller chunks and performingpipelined data transfers between the GPU, the CPU, andthe network. To orchestrate the pipelined data movement,we create a temporary pool of host-side buffers that areregistered with the GPU driver (CUDA or OpenCL) forfaster DMA transfers. The buffer pool is created at MPI Inittime and destroyed during MPI Finalize. The system ad-ministrator can choose to enable CUDA and/or OpenCLwhen configuring the MPICH installation. Depending onthe choice of the GPU library, the buffer pool is created bycalling either cudaMallocHost for CUDA or clCreateBuffer(with the CL MEM ALLOC HOST PTR flag) for OpenCL.

To calculate the ideal pipeline packet size, we firstindividually measure the network and PCIe bandwidthsat different data sizes (Figure 3), then choose the packetsize at the intersection point of the above channel rates,64 KB for our experimental cluster (Section 6). If theperformance at the intersection point is still latency boundfor both data channels (network and PCIe), then we pickthe pipeline packet size to be the size of the smallest packetat which the slower data channel reaches peak bandwidth.The end-to-end data transfer will then also work at the netpeak bandwidth of the slower data channel. Also, only twopackets are needed to do pipelining by double buffering:one channel receives the GPU packet to the host, while theother sends the previous GPU packet over the network.We therefore use two CUDA streams and two OpenCLcommand queues per device per MPI request to facilitatepipelining.

The basic pipeline loop for a “send” operation is asfollows (“receive” works the same way, but the directionof the operations is reversed). Every time we prepareto send a packet over the network, we check for thecompletion of the previous GPU-to-CPU transfer by callingcudaStreamSynchronize or a loop of cudaStreamQuery forCUDA (or the corresponding OpenCL calls). However,we found that the GPU synchronization/query calls on

Page 5: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 5

100

1000

10000

16384 32768 65536 131072 262144 524288 1048576 2097152

Ban

dw

idth

(M

B/s

)

Packet Sizes (Bytes)

Network Bandwidth

GPU-CPU PCIe Bandwidth

Effective Network Bandwidth (with redundant GPU synchoverhead)

Fig. 3: Choosing the pipeline parameters: network – InfiniBand,transfer protocol – R3.

already completed CPU-GPU copies caused a significantoverhead in our experimental cluster, which hurt the effec-tive network bandwidth and forced us to choose a differentpipeline packet size. For example, we measured the cost ofstream query/synchronization operations as approximately20 µs, even though the data transfer had been completed.Moreover, this overhead occurs every time a packet is sentover the network, as shown in Figure 3 by the “EffectiveNetwork Bandwidth” line. We observed that the impact ofthe synchronization overhead is huge for smaller packetsizes but becomes negligible for larger packet sizes (2 MB).Also, we found no overlap between the PCIe bandwidthand the effective network bandwidth rates, and the PCIewas always faster for all packet sizes. Thus, we picked thesmallest packet size that could achieve the peak effectivenetwork bandwidth (in our case, 256 KB) as the pipelinetransfer size for MPI-ACC. Smaller packet sizes (<256 KB)caused the effective network bandwidth to be latency boundand were thus not chosen as the pipeline parameters. InMPI-ACC, we use the pipelining approach to transfer largemessages—namely, messages that are at least as large asthe chosen packet size—and fall back to the nonpipelinedapproach when transferring smaller messages.

4.2.2 OpenCL Issues and OptimizationsIn OpenCL, device data is encapsulated as a cl mem objectthat is created by using a valid cl context object. To transferthe data to/from the host, the programmer needs validcl device id and cl command queue objects, which are allcreated by using the same context as the device data. At aminimum, the MPI interface for OpenCL communicationrequires the target OpenCL memory object, context, anddevice ID objects as parameters. The command queueparameter is optional and can be created by using the aboveparameters. Within the MPICH implementation, we eitheruse the user-provided command queue or create severalinternal command queues for device-host data transfers.Within MPICH, we also create a temporary OpenCL bufferpool of pinned host-side memory for pipelining. However,OpenCL requires that the internal command queues and thepipeline buffers also be created by using the same contextas the device data. Moreover, in theory, the OpenCL contextcould change for every MPI communication call, and so theinternal OpenCL objects cannot be created at MPI Init time.Instead, they must be created at the beginning of every MPIcall and destroyed at the end of it.

The initialization of these temporary OpenCL objects isexpensive, and their repeated usage severely hurts perfor-

1

2

3

4

a

b

c

8-9

9-128-10

10-12

8-108-12

10-12

(a) Bipartite graph represen-tation with people 1–4 andlocations a–c.

a

b

c8:00 9:00 10:00 11:00 12:00

1,2 1

1,3,4 2,3

4

2

3,4

(b) Temporal and spatial people-to-people contacts.

8:00 9:00 10:00 11:00 12:00

234

latent infectious

1susceptible

susceptible

Potential Transmissionsucept.

(c) Occupancy of location b overlaid with the health states of its occupants.The dotted area shows the time of possible disease transmission.

Fig. 4: Computational epidemiology simulation model (figureadapted from [9]).

mance. We cache the command queue and pipeline bufferobjects after the first communication call and reuse them ifthe same OpenCL context and device ID are used for thesubsequent calls, which is a plausible scenario. If any futurecall involves a different context or device ID, we clear andreplace our cache with the most recently used OpenCLobjects. In this way, we can amortize the high OpenCLinitialization cost across multiple calls and significantlyimprove performance. We use a caching window of one,which we consider to be sufficient in practice.

5 APPLICATION CASE STUDIES

In this section, we perform an in-depth analysis of the de-fault MPI+GPU application design in scientific applicationsfrom computational epidemiology and seismology model-ing. We identify the inherent data movement inefficienciesand show how MPI-ACC can be used to explore new designspaces and create novel application-specific optimizations.

5.1 EpiSimdemicsGPU-EpiSimdemics [9], [10] is a high-performance, agent-based simulation program for studying the spread of epi-demics through large-scale social contact networks andthe coevolution of disease, human behavior, and the so-cial contact network. The participating entities in GPU-EpiSimdemics are persons and locations, which are repre-sented as a bipartite graph (Figure 4a) and interact withone another iteratively over a predetermined number ofiterations (or simulation days). The output of the simulationis the relevant disease statistics of the contagion diffusion,such as the total number of infected persons or an infectiongraph showing who infected whom and the time andlocation of the infection.

5.1.1 PhasesEach iteration of GPU-EpiSimdemics consists of twophases: computeVisits and computeInteractions. During thecomputeVisits phase, all the person objects of every pro-cessing element (or PE) first determine the schedules forthe current day, namely, the locations to be visited and

Page 6: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 6

the duration of each visit. These visit messages are sent tothe destination location’s host PE (Figure 4a). Computationof the schedules is overlapped with communication of thecorresponding visit messages.

In the computeInteractions phase, each PE first groupsthe received visit messages by their target locations. Next,each PE computes the probability of infection transmissionbetween every pair of spatially and temporally colocatedpeople in its local location objects (Figure 4b), whichdetermines the overall disease spread information of thatlocation. The infection transmission function depends onthe current health states (e.g., susceptible, infectious, latent)of the people involved in the interaction (Figure 4c) andthe transmissibility factor of the disease. These infectionmessages are sent back to the “home” PEs of the infectedpersons. Each PE, upon receiving its infection messages,updates the health states of the infected individuals, whichwill influence their schedules for the following simulationday. Thus, the messages that are computed as the outputof one phase are transferred to the appropriate PEs asinputs of the next phase of the simulation. The system issynchronized by barriers after each simulation phase.

5.1.2 Computation-Communication Patterns andMPI-ACC-Driven OptimizationsIn GPU-EpiSimdemics, each PE in the simulation is im-plemented as a separate MPI process. Also, the computeIn-teractions phase of GPU-EpiSimdemics is offloaded andaccelerated on the GPU while the rest of the computationsare executed on the CPU [10]. The current implementationof GPU-EpiSimdemics assumes one-to-one mapping ofGPUs to MPI processes. In accordance with the GPU-EpiSimdemics algorithm, the output data elements from thecomputeVisits phase (i.e., visit messages) are first receivedover the network, then merged, grouped, and preprocessedbefore the GPU can begin the computeInteractions phaseof GPU-EpiSimdemics. Moreover, there are two GPU com-putation modes depending on how the visit messages areprocessed on the GPUs. In this paper, we discuss theexclusive GPU computation mode, but discussion of thecooperative CPU-GPU computation mode can be found inour prior work [27].

Preprocessing phase on the GPU: As a preprocessingstep in the computeInteractions phase, we modify the datalayout of the visit messages to be more amenable to themassive parallel architecture of the GPU [10]. Specifically,we unpack the visit message structures to a 2D time-binmatrix, where each row of the matrix represents a person-location pair and the cells in the row represents fixed timeslots of the day: that is, each visit message corresponds to asingle row in the person-timeline matrix. Depending on thestart time and duration of a person’s visit to a location,the corresponding row cells are marked as visited. Thepreprocessing logic of data unpacking is implemented asa separate GPU kernel at the beginning of the computeIn-teractions phase. The matrix data representation enablesa much better SIMDization of the computeInteractionscode execution, which significantly improves the GPU

Network

PEi (Host CPU)

GPUi (Device)

1. Copy to GPU

2. Pre-process on GPU

GPUi (Device)

PEi (Host CPU)

Network

1a. Pipelined or direct data transfers to GPU

1b. Pipelined pre-processing with GPU communication

Fig. 5: Exclusive GPU computation mode of GPU-EpiSimdemics.Left: manual MPI+CUDA design. Right: MPI-ACC–enabled de-sign, where the visit messages are transferred and preprocessedon the device in a pipelined manner.

performance. However, we achieve the benefits at the costof a larger memory footprint for the person-timeline matrix,as well as a computational overhead for the data unpacking.

Basic MPI+GPU communication-computation pipeline:In the naı̈ve data movement approach, in each PE, wereceive the visit messages in the main memory during thecomputeVisits phase, transfer the aggregate data to the localGPU (device) memory across the PCIe bus, and then beginthe preprocessing step of the computeInteractions phase.While all PEs communicate the visit messages with everyother PE, the number of pairwise visit message exchangesis not known beforehand; in other words, it is not a typicalcollective all-to-all or a scatter/gather operation. On theother hand, each PE preallocates CPU buffer fragmentsand registers them as persistent receive requests with theMPI library by using the MPI_Recv_init call to enablepersistent point-to-point communication among the PEs.Moreover, each PE has to create persistent CPU receivebuffers corresponding to every other participating PE in thesimulation; that is, receive buffers increase with number ofPEs and cannot be reused from a constant buffer pool.

Whenever a buffer fragment is received into the corre-sponding receive buffer, it is copied into a contiguous visitvector in the CPU’s main memory. The computeInteractionsphase of the simulation then copies the aggregated visitvector to the GPU memory for preprocessing. While theCPU-CPU communication of visit messages is somewhatoverlapped with their computation on the source CPUs,the GPU and the PCIe bus will remain idle until the visitmessages are completely received, merged, and ready to betransferred to the GPU.

Advanced MPI+GPU communication-computationpipeline: In this optimization, we still preallocatepersistent receive buffers on the CPU in each PE andregister them with the MPI library as the communicationendpoints by calling MPI_Recv_init. But, we create thecontiguous visit vector in GPU memory, so that whenevera PE receives a visit buffer fragment on the CPU, weimmediately enqueue an asynchronous CPU-GPU datatransfer to the contiguous visit vector and also enqueue theassociated GPU preprocessing kernels, thereby manuallycreating the communication-computation pipeline.

In order to enable asynchronous CPU-GPU data trans-

Page 7: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 7

fers, however, the persistent receive buffers must be non-pageable (pinned) memory. Also, since the number of re-ceive buffers increases with the number of PEs, the pinnedmemory footprint also increases with the number of PEs.This design reduces the available pageable CPU memory,a situation that could lead to poor CPU performance [6].

The pinned memory management logic can be imple-mented at the application level in a couple of ways. In thefirst approach, the pinned memory pool is created beforethe computeVisits phase begins and is destroyed once thephase finishes, but the memory management routines areinvoked every simulation iteration. While this approachis relatively simple to implement, repeated memory man-agement leads to significant performance overhead. In thesecond approach, the pinned memory pool is created once,before the main simulation loop, and is destroyed afterthe loop ends, thus avoiding the performance overheadof repeated memory management. However, this designreduces the available pageable CPU memory, not only forthe computeVisits phase, but also for the other phases of thesimulation, including computeInteractions. We discuss theperformance tradeoffs of these two memory managementtechniques in Section 6.2.

MPI-ACC–enabled communication-computationpipeline: Since MPI-ACC handles both CPU andGPU buffers, in each PE we preallocate persistent bufferfragments on the GPU and register them with the MPIlibrary by using MPI_Recv_init. Without MPI-ACC’sGPU support, one cannot create persistent buffers on theGPU as MPI communication endpoints. In this approach,the receive buffers on the GPU increase with numberof PEs, and the approach does not require a growingpinned memory pool on the host. Furthermore, MPI-ACCinternally creates a constant pool of pinned memoryduring MPI_Init and automatically reuses it for allpipelined data communication, thereby providing betterprogrammability and application scalability. Internally,MPI-ACC may either pipeline the internode CPU-GPUdata transfers via the host CPU’s memory or use directGPU transfer techniques (e.g., GPUDirect RDMA), ifpossible; but these details are hidden from the programmer.

Along with the preallocated persistent GPU bufferfragments, the contiguous visit vector is created in theGPU memory itself. As soon as a PE receives the visitbuffer fragments on the GPU, we enqueue kernels tocopy data within the device to the contiguous visit vec-tor and also enqueue the associated GPU preprocessingkernels, thereby creating the end-to-end communication-computation pipeline. Thus, we completely overlap thevisit data generation on the CPU with internode CPU-GPU data transfers and GPU preprocessing. In this way,the preprocessing overhead is completely hidden and re-moved from the computeInteractions phase. Moreover, theCPU, GPU, and the interconnection networks are all keptbusy, performing either data transfers or the preprocessingexecution.

MPI-ACC’s internal pipelined CPU-GPU data transferlargely hides the PCIe transfer latency during the compute-

Visits phase. However, it still adds a non-negligible costto the overall communication time when compared withthe CPU-CPU data transfers of the default MPI+GPU im-plementation. Nevertheless, our experimental results showthat the gains achieved in the computeInteractions phasedue to the preprocessing overlap outweigh the communi-cation overheads of the computeVisits phase for all systemconfigurations and input data sizes.

In summary, MPI-ACC helps the programmer focus onthe high-level application logic by enabling automatic andefficient low-level memory management techniques. More-over, MPI-ACC exposes a natural interface to communicatewith the target device (CPU or GPU), without treatingCPUs as explicit communication relays.

5.2 FDM-SeismologyFDM-Seismology is our MPI+GPU hybrid implementationof an application that models the propagation of seismo-logical waves using the finite-difference method by takingthe Earth’s velocity structures and seismic source models asinput [11]. The application implements a parallel velocity-stress, staggered-grid finite-difference method for propaga-tion of waves in a layered medium. In this method, thedomain is divided into a three-dimensional grid, and a one-point-integration scheme is used for each grid cell. Sincethe computational domain is truncated in order to keepthe computation tractable, absorbing boundary conditions(ABCs) are placed around the region of interest so asto keep the reflections minimal when boundaries are im-pinged by the outgoing waves. This strategy helps simulateunbounded domains. In our application, PML (perfectlymatched layers) absorbers [28] are being used as ABCs fortheir superior efficiency and minimal reflection coefficient.The use of a one-point integration scheme leads to aneasy and efficient implementation of the PML absorbingboundaries and allows the use of irregular elements in thePML region [11].

5.2.1 Computation-Communication PatternsThe simulation operates on the input finite-difference (FD)model and generates a three-dimensional grid as a first step.Our MPI-based parallel version of the application dividesthe input FD model into submodels along different axessuch that each submodel can be computed on differentCPUs (or nodes). This domain decomposition techniquehelps the application scale to a large number of nodes. Eachprocessor computes the velocity and stress wavefields in itsown subdomain and then exchanges the wavefields with thenodes operating on neighbor subdomains, after each set ofvelocity or stress computation (Figure 6). Each processorupdates its own wavefields after receiving all its neighbors’wavefields.

These computations are run multiple times for betteraccuracy and convergence of results. In every iteration,each node computes the velocity components followed bythe stress components of the seismic wave propagation.The wavefield exchanges with neighbors take place after

Page 8: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 8

GPU CPU Net I/O

Velocity Kernels

cudaMemcpy

cudaMemcpy

Stress Kernels

cudaMemcpy

cudaMemcpy

Write Results

MPI+GPU

Data Marshaling

MPI

Data Marshaling

MPI

GPU CPU Net I/O

Velocity Kernels

Stress Kernels

cudaMemcpy

Write Results

MPI-ACC/MPI+GPU Adv

Data Marshaling

MPI

Data Marshaling

MPI

Fig. 6: Communication-computation pattern in the FDM-Seismology application. Left: basic MPI+GPU execution modewith data marshaling on CPU. Right: execution modes withdata marshaling on GPU. MPI-ACC automatically communicatesthe GPU data; the MPI+GPU Adv case explicitly stages thecommunication via the CPU.

each set of velocity and stress computations. This MPIcommunication takes place in multiple stages wherein eachcommunication is followed by an update of local wave-fields and a small postcommunication computation on localwavefields. At the end of each iteration, the updated localwavefields are written to a file.

The velocity and stress wavefields are stored as largemultidimensional arrays on each node. In order to optimizethe MPI computation between neighbors of the FD domaingrid, only a few elements of the wavefields, those neededby the neighboring node for its own local update, arecommunicated to the neighbor, rather than whole arrays.Hence, each MPI communication is surrounded by data-marshaling steps, where the required elements are packedinto a smaller array at the source, communicated, and thenunpacked at the receiver in order to update its local data.

5.2.2 GPU Acceleration of FDM-SeismologyWe describe a couple of GPU execution modes of FDM-Seismology.

MPI+GPU with data marshaling on CPU (MPI+GPU):Our GPU-accelerated version of FDM-Seismology per-forms the velocity and stress computations as GPU kernels.In order to transfer the wavefields to other nodes, it firstcopies the bulk data from the GPU buffers to CPU memoryover the PCIe bus and then transfers the individual wave-fields over MPI to the neighboring nodes (Figure 6). All thedata-marshaling operations and small postcommunicationcomputations are performed on the CPU itself. The newlyupdated local wavefields that are received over MPI arethen bulk transferred back to the GPU before the start ofthe next stress or velocity computation on the GPU.

MPI+GPU with data marshaling on GPU (MPI+GPUAdv): In this execution mode, the data-marshaling opera-tions are moved to the GPU to leverage the faster GDDR5memory module and the massively parallel GPU archi-tecture. Consequently, the CPU-GPU bulk data transfersbefore and after each velocity-stress computation kernel arecompletely avoided. The need to explicitly bulk transferdata from the GPU to the CPU arises only at the end ofthe iteration, when the results are transferred to the CPUto be written to a file (Figure 6).

5.2.3 MPI-ACC–Enabled OptimizationsGPU-based data marshaling suffers from the following dis-advantage in the absence of GPU-integrated MPI. All data-marshaling steps are separated by MPI communication, andeach data-marshaling step depends on the previously mar-shaled data and the received MPI data from the neighbors.In other words, after each data-marshaling step, data has tobe explicitly moved from the GPU to the CPU only for MPIcommunication. Similarly, the received MPI data has to beexplicitly moved back to the GPU before the next mar-shaling step. In this scenario, the application uses the CPUonly as a communication relay. If the GPU communicationtechnology changes (e.g., GPUDirect RDMA), we will haveto largely rewrite the FDM-Seismology communicationcode to achieve the expected performance.

With MPI-ACC as the communication library, we stillperform data marshaling on the GPU, but we communicatethe marshaled data directly to and from the GPU withoutexplicitly using the CPU for data staging. Also, the bulktransfer of data still happens only once at the end of eachiteration, in order to write the results to a file. But, the data-marshaling step happens multiple times during a single iter-ation; and consequently the application launches a series ofGPU kernels. While consecutive kernels entail launch andsynchronization overhead per kernel invocation, the benefitsof faster data marshaling on the GPU and optimized MPIcommunication outweigh the kernel overheads.

Other than the benefits resulting from GPU-driven datamarshaling, a GPU-integrated MPI library benefits theFDM-Seismology application in the following ways: (1)it significantly enhances the productivity of the program-mer, who is no longer constrained by the fixed CPU-only MPI communication and can easily choose the ap-propriate device as the communication target end-point;(2) the pipelined data transfers within MPI-ACC furtherimprove the communication performance over the network;and (3) regardless of the GPU communication technologythat may become available in the future, our MPI-ACC–driven FDM-Seismology code will not change and willautomatically benefit from the performance upgrades thatare made available by the subsequent GPU-integrated MPIimplementations (e.g., support for GPU-Direct RDMA).

6 EVALUATION

In this section, we describe our experimental setup followedby the performance evaluation of MPI-ACC via latencymicrobenchmarks. Next, we demonstrate the efficacy of theMPI-ACC–enabled optimizations in GPU-EpiSimdemicsand FDM-Seismology. Using both microbenchmarks andGPU-EpiSimdemics, we discuss the impact of shared re-source (hardware and software) contention on MPI-ACC’scommunication performance.

We conducted our experiments on HokieSpeed, a state-of-the-art, 212-teraflop hybrid CPU-GPU supercomputerhoused at Virginia Tech. Each HokieSpeed node containstwo hex-core Intel Xeon E5645 CPUs running at 2.40 GHzand two NVIDIA Tesla M2050 GPUs. The host memory

Page 9: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 9

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M

Late

ncy

s)

Data Size (bytes)

MPI-ACC

Manual Blocking

Manual Non blocking (Pipelined)

MPI Send/Receive (CPU-only; Lower Bound)

Fig. 7: Internode communication latency for GPU-to-GPU(CUDA) data transfers, InfiniBand transfer protocol: R3. Similarperformance is observed for OpenCL data transfers. The chosenpipeline packet size for MPI-ACC is 256 KB.

capacity is 24 GB, and each GPU has a 3 GB devicememory. The internode interconnect is QDR InfiniBand.We used up to 128 HokieSpeed nodes and both GPUsper node for our experiments. We used the GCC v4.4.7compiler and CUDA v5.0 with driver version 310.19.

6.1 Microbenchmark AnalysisImpact of Pipelined Data Transfer: In Figure 7, wecompare the performance of MPI-ACC with the manualblocking and manual pipelined implementations. Our in-ternode GPU-to-GPU latency tests show that MPI-ACCis better than the manual blocking approach by up to48.3% and is up to 18.2% better than the manual pipelinedimplementation, especially for larger data transfers. Themanual pipelined implementation repeatedly invokes MPIcalls, causing multiple handshake messages to be sent backand forth across the network and thus hurting performance.On the other hand, we perform the handshake only once inMPI-ACC to establish the send-receiver identities, followedby low-overhead pipelining. We perform pipelining in MPI-ACC only for messages that are larger than the pipelinepacket size, and we fall back to the default blockingapproach for smaller data sizes. Hence, we see that theperformance of MPI-ACC is comparable to the manualblocking approach for smaller message sizes.

Impact of OpenCL Object Caching: Our OpenCLcaching optimization improves the internode GPU-to-GPUcommunication latency from 3% for larger data sizes (64MB) to 88.7% for smaller data sizes (< 256 KB). Evenwhere the programmers provide their custom commandqueue, the pipeline buffers still have to be created forevery MPI communication call; hence, caching improvesperformance.

6.2 Case Study Analysis: EpiSimdemicsWe compare the combined performance of all the phasesof GPU-EpiSimdemics (computeVisits and computeInterac-tions), with and without the MPI-ACC–driven optimizationsdiscussed in Section 5.1.2. We choose different-sized inputdatasets of synthetic populations from two U.S. states:Washington (WA), with a population of 5.7 million, andCalifornia (CA), with a population of 33.1 million. In thispaper, we present results and detailed analysis of WA;

0

20

40

60

80

100

120

140

160

180

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

MP

I+G

PU

MP

I+G

PU

Ad

vM

VA

PIC

HM

PI-

AC

C

8x2 16x1 16x2 32x1 32x2 64x1 64x2 128x1 128x2

Exe

cuti

on

Tim

e (

seco

nd

s)

Nodes x GPUs-per-node

Preprocessing on GPU

Rest of ComputeInteractions Phase

ComputeVisits Phase

Fig. 8: Execution profile of GPU-EpiSimdemics over various nodeconfigurations.

analysis of CA is described in our prior work [27]. We alsovary the number of compute nodes from 8 to 128 and thenumber of GPU devices between 1 and 2. We begin fromthe smallest node-GPU configuration that can fit the entireproblem in the available GPU memory. We also comparethe application performance when MVAPICH (v2.0a) isused as the GPU-aware MPI implementation of choice.

Our results in Figure 8 indicate that our MPI-ACC–driven optimizations perform better than the basic block-ing MPI+GPU implementations by an average of 9.2%and by up to 13.3% for WA. The performance of theMPI-ACC–driven solution is similar to the performanceof the MVAPICH-based and the manual MPI+GPU (ad-vanced) implementations. Since the CPU-GPU transfer isnot a bottleneck in GPU-EpiSimdemics, the specific datapipelining logic of either MPI-ACC or MVAPICH doesnot directly affect the performance gains. On the otherhand, the preprocessing step (data unpacking) of the com-puteInteractions phase is completely overlapped with theasynchronous CPU-to-remote-GPU communication, for allnode configurations. Note that the advanced MPI+GPU im-plementation uses the manual pinned memory managementtechniques that we implemented at the application level,which achieves better performance but with a much morecomplex code.

For larger node configurations, the local operating datasetin the computeInteractions phase becomes smaller, andhence that the basic MPI+GPU solution takes less timeto execute the preprocessing stage; in other words, theabsolute gains achieved by hiding the preprocessing stepget diminished for GPU-EpiSimdemics. However, we haveshown that MPI-ACC can enable the developer to createnewer optimizations for better latency hiding and resourceutilization.

Data management complexity vs. performance tradeoffs:While the advanced MPI+GPU implementation achievedcomparable performance to the MPI-ACC–based solution,it put the burden of explicit data management on theapplication programmer. We discussed in Section 5.1.2 that,on the other hand, the user can write simpler code and avoidexplicit data management but has to repeatedly create anddestroy the receive buffers for every simulation iteration,thereby losing performance. Figure 9 shows the perfor-mance tradeoffs of the two approaches. We observe that ex-

Page 10: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 10

020406080

100120140160180

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

Man

ual

Mgm

t.

No

Mgm

t.

8x2 16x1 16x2 32x1 32x2 64x1 64x2 128x1 128x2

Exe

cuti

on

Tim

e (

seco

nd

s)

Nodes x GPUs-per-node

ComputeInteractions Phase

Mgmt. of GPU Communication Data

Rest of ComputeVisits Phase

Fig. 9: Analysis of the data management complexity vs. per-formance tradeoffs. Manual data management achieves betterperformance at the cost of high code complexity. The case with noexplicit data management has simpler code but performs poorly.

plicit data management is better for all node configurationsand can achieve up to 4.5× performance improvement.Without data management, the pinned memory footprintof the receive buffers increases with the number of MPIprocesses, thereby entailing bigger performance losses forlarger nodes. To quantify the degree of performance loss,we measured the individual memory allocation costs usingsimple microbenchmarks and found that CUDA’s pinnedmemory allocator (cudaMallocHost) was about 26%slower than the vanilla CPU memory allocator (malloc)for single CUDA contexts. We also observed that the pinnedmemory allocation cost increased linearly with the numberof GPUs or CUDA contexts, whereas memory manage-ment in multiple processes and CUDA contexts shouldideally be handled independently in parallel. Consequently,in Figure 9, we see that for the same number of MPIprocesses, the node configuration with two MPI processes(or GPUs) per node performs worse than the node with asingle MPI process; for example, the 64×2 configuration isslower than the 128×1 one. Thus, efficient pinned memorymanagement is essential for superior performance, andMPI-ACC provides that automatically to the programmers.

Discussion: The basic MPI+GPU solution has prepro-cessing overhead but does not have significant memorymanagement issues. While the advanced MPI+GPU imple-mentation gains from hiding the preprocessing overhead, itloses from either nonscalable pinned memory managementor poor programmer productivity. On the other hand, MPI-ACC provides a more scalable solution by (1) automaticallymanaging a fixed-size pinned buffer pool for pipeliningand (2) creating buffer pools just once at MPI_Initand destroying them at MPI_Finalize. MPI-ACC thusgains from both hiding the preprocessing overhead andefficient pinned memory management. MPI-ACC decouplesthe lower-level memory management logic from the high-level simulation implementation, thereby enabling bothperformance and productivity.

6.3 Case Study Analysis: FDM-SeismologyIn this section, we analyze the performance of the differentphases of the FDM-Seismology application and evaluatethe effect of MPI-ACC and MVAPICH on the application.FDM-Seismology is implemented with both CUDA and

0

1000

2000

3000

4000

5000

6000

MP

I+G

PU

MP

I+G

PU

Ad

v

MV

AP

ICH

MP

I-A

CC

MP

I+G

PU

MP

I+G

PU

Ad

v

MV

AP

ICH

MP

I-A

CC

MP

I+G

PU

MP

I+G

PU

Ad

v

MV

AP

ICH

MP

I-A

CC

MP

I+G

PU

MP

I+G

PU

Ad

v

MV

AP

ICH

MP

I-A

CC

16x1 32x1 64x1 128x1

Ite

rati

on

Tim

e (

ms)

Marshaling

MPI Communication

cudaMemcpy

Stress Computation

Velocity Computation

Fig. 10: Analyzing the FDM-Seismology application with thelarger input data (Dataset-2). Note that MPI Communication refersto CPU-CPU data transfers for the MPI+GPU and MPI+GPUAdv cases and GPU-GPU (pipelined) data transfers for MPI-ACC and MVAPICH. The performance difference between MPI-ACC(CUDA) and MPI-ACC(OpenCL) is negligible and only theMPI-ACC(CUDA) result is shown in this figure.

OpenCL, and while MPI-ACC is evaluated using bothCUDA and OpenCL, MVAPICH can be evaluated only withthe CUDA-based code. We vary the nodes from 2 to 128with 1 GPU per node and use small and large datasets asinput. Our scalability experiments begin from the smallestnumber of nodes required to fit the given data in the devicememory. For the larger input data (i.e., Dataset-2), the sizeof the MPI transfers increases by 2×, while the size of datato be marshaled increases by 4× when compared with thesmaller Dataset-1.

Figure 10 shows the performance of the FDM-Seismology application, with and without the GPU-based data marshaling. We report the average wall-clocktime across all the processes because the computation-communication costs vary depending on the virtual locationof the process in the application’s structured grid represen-tation. The application’s running time is composed mainlyof velocity and stress computations (>60%) and does notchange for the three application designs.

In the basic MPI+GPU case, we perform both data-marshaling operations and MPI communication from theCPU. Thus, the application has to move large wavefielddata between the CPU and the GPU for data marshaling andMPI communication after every stress and velocity compu-tation phase over every iteration. In the MPI+GPU Adv,MVAPICH, and MPI-ACC–driven scenarios, we performdata marshaling on the GPU itself; hence, smaller-sizedwavefield data is transferred from the GPU to the CPU onlyonce per iteration for output generation. By performing datamarshaling on the GPU, we avoid the large bulk CPU-GPUdata transfers and improve the overall performance over thebasic MPI+GPU design with data marshaling on the CPU.Data marshaling on the GPU provides performance gainswhile MPI-ACC improves programmer productivity bydirectly communicating the GPU buffers (CUDA/OpenCL)in the application.

Scalability analysis: Figure 11 shows the performanceimprovement due to the MPI-ACC–enabled GPU data mar-shaling strategy over the basic MPI+GPU implementationwith CPU data marshaling. We see that the performancebenefits due to the GPU data marshaling decrease with

Page 11: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 11

0%

10%

20%

30%

40%

50%

2x1 4x1 8x1 16x1 32x1 64x1 128x1Be

nef

it d

ue

to

MP

I-A

CC

Nodes x GPUs-per-node

Dataset-1 (Small)

Dataset-2 (Large)

Fig. 11: Scalability analysis of FDM-Seismology application withtwo datasets of different sizes. The baseline for speedup is thenaı̈ve MPI+GPU programming model with CPU data marshaling.

increasing number of nodes, because of the following rea-sons. For a given dataset, the per-node data size decreaseswith increasing number of nodes. Thus, the costly CPU-GPU bulk data transfers are reduced (Figure 10), and theoverall benefits of GPU-based data marshaling itself areminimized. Also, for a larger number of nodes, the appli-cation’s MPI communication cost becomes significant whencompared with the computation and data marshaling costs.In such a scenario, the CPU-to-CPU MPI communicationof the MPI+GPU and MPI+GPU Adv implementationswill have less overhead than does the pipelined GPU-to-GPU MPI communication of the MPI-ACC–enableddesign. If newer technologies such as GPUDirect-RDMAare integrated into MPI, we can expect the GPU-to-GPUcommunication overhead to be reduced, but the overallbenefits of GPU data marshaling itself will still be limitedbecause of the reduced per-process working set.

7 ANALYSIS OF CONTENTION

In this section, we provide some insights into the scalabledesign of MPI-ACC. Specifically, we show that MPI-ACCis designed to work concurrently with other existing GPUworkloads with minimum contention; that is, one should beable to perform MPI-ACC GPU-GPU communication andother user-specified GPU tasks (kernel or data transfers)simultaneously with minimum performance degradation forboth tasks. We analyze the contention effects of MPI-ACC,MVAPICH and manual MPI+GPU on concurrent GPU andPCIe workloads.

Sources of contention: NVIDIA Fermi GPUs haveone hardware queue each for enqueueing GPU kernels,D2H data transfers, and H2D data transfers. Operationson different hardware queues can potentially overlap. Forexample, a GPU kernel can overlap with H2D and D2Htransfers simultaneously. However, operations enqueued tothe same hardware queue will be processed serially. If aGPU task oversubscribes a hardware queue by aggressivelyenqueueing multiple operations of the same type, then it canseverely slow other GPU tasks contending to use the samehardware queue.

GPU streams are software workflow abstractions for asequence of operations that execute in issue-order on theGPU. Stream operations are directed to the appropriatehardware queue depending on the operation type. Op-erations from different streams can execute concurrentlyand may be interleaved, whereas operations within thesame stream are processed serially, leading to softwarecontention.

0

20

40

60

80

100

120

140

160

180

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

Cu

sto

m S

tre

am

Stre

am 0

8x2 16x1 16x2 32x1 32x2 64x1 64x2 128x1 128x2

Exe

cuti

on

Tim

e (

seco

nd

s)

Nodes x GPUs-per-node

ComputeInteractions Phase

ComputeVisits Phase

Fig. 12: Characterizing the contention impact of CUDA’s stream-0in GPU-EpiSimdemics.

In summary, contention among GPU operations can beof two types: hardware contention, where one or morehardware queues of the GPU are oversubscribed by thesame type of operation, or software contention, wheredifferent types of operations may be issued but to the sameGPU stream. In MPI-ACC, we have carefully minimizedboth types of contention by staggered enqueueing of H2Dand D2H operations to different GPU streams, therebyenabling maximum concurrency.

Microbenchmark design: We extended the SHOC bench-mark suite’s contention-mt application for the mi-crobenchmark study. The benchmark creates two MPIprocesses, each on a separate node and controlling thetwo local GPUs. Each MPI process is also dual-threadedand concurrently runs one task per thread, where task-0by thread-0 does point-to-point GPU-to-GPU MPI com-munication with the other process and task-1 by thread-1 executes local non-MPI GPU tasks, such as computekernels or H2D and D2H data transfer loops. CUDA allowsthe same GPU context to be shared among all the threads(tasks) in the process. We share the local GPU between bothtasks. To measure the contention impact, we first executetasks 0 and 1 independently without contention and thenexecute them concurrently to induce contention. Each taskis run for 100 loop iterations for both the independent andconcurrent runs. We measure and report the performancedifference between the tasks’ independent and concurrentruns as the incurred contention.

7.1 Discussion of Software ContentionCUDA’s stream-0 (default stream) is unique in that itis completely ordered with all operations issued on anystream of the device. That is, issuing operations on stream-0would be functionally equivalent to synchronizing the entiredevice before and after each operation. Although MPI-ACC privately creates and uses custom streams to minimizesoftware contention with other streams, a concurrent useroperation to stream-0 can inadvertently stall any MPI-ACCoperation on that GPU until stream-0 has completed. Onthe other hand, OpenCL does not have special queues anddoes not incur software contention.

Contention due to stream-0 can be seen even in GPU-EpiSimdemics, and we analyze its effect as follows. InGPU-EpiSimdemics, the internode CPU-GPU communica-tion of the visit messages is overlapped with a prepro-

Page 12: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 12

cessing kernel that performs data layout transformation(Section 5.1). While we use non-0 streams within MPI-ACC for the internode communication of visit messages,the preprocessing kernel may be launched with the user’schosen CUDA stream. Figure 12 shows that the perfor-mance of GPU-EpiSimdemics is about 6.6% slower whenthe preprocessing kernels use stream-0 instead of a non-0 stream, and the slowdown can be up to 16.3% forsome node configurations. While MPI-ACC’s streams aredesigned to scale, a poor application design using stream-0 can cause an apparent slowdown in MPI-ACC’s datatransfer performance.

7.2 Minimizing the Hardware Contention

MPI-ACC uses the D2H and H2D hardware queues of theGPU for send and receive, respectively. In theory, MPI-ACC communication can overlap with kernel invocations orother data transfer operations in the opposite direction, thatis, using the other data transfer queue. However, MPI-ACCcan cause contention with another data transfer operation inthe same direction. For example, MPI_Send can contendwith a concurrent D2H data transfer. MPI-ACC operationscan also potentially contend with the on-device memorycontroller. For example, MPI_Send or MPI_Recv canslow a global-memory-intensive kernel that is accessingthe same memory module. In this section, we quantify andevaluate the global memory and PCIe contention effects.

Global memory contention analysis: We study theimpact of global memory contention by executing MPI-ACC operations in task-0 and global memory read/writebenchmarks in task-1 with custom CUDA streams. Ourexperiments indicate that the performance drop due tocontention in the MPI-ACC communication is about 4%,whereas the global memory kernels slow by about 8%. Theaverage MPI-ACC call runs longer than an average globalmemory access, so MPI-ACC has less relative performancereduction. On the other hand, the performance impactof MPI-ACC on on-chip (local) memory accesses andsimple computational kernels using custom CUDA streamsis less, where the performance degradation in the MPI-ACC communication is about 3% and the GPU workloadsdo not have any noticeable slowdown. Because of spaceconstraints, we omit explicit performance graphs.

PCIe contention analysis with data transfers in theopposite direction: We study the impact of PCIe contentionby having task-0 perform MPI_Send or MPI_Recv com-munication operations with GPU-0, while task-1 executesH2D or D2H calls. This approach gives four different taskcombinations, of which two combinations perform bidirec-tional data transfers and two combinations transfer data inthe same direction. In this paper, we report the results byrunning MPI_Send (task-0) concurrently with H2D andD2H transfers on the same GPU (task-1). The contentionanalysis of MPI_Recv with H2D and D2H transfers isidentical, and we exclude it because of space constraints.If task-0 and task-1 perform bidirectional data transfersand use custom CUDA streams, then we find that the

average slowdown of task-0 is 6% and the H2D task (task-1) has a negligible slowdown. Ideally, if the bidirectionalbandwidth were to be twice the unidirectional bandwidth,then both the concurrent tasks would have no slowdown.In our experimental platform, however, the bidirectionalbandwidth is only about 19.3% more than the unidirectionalbandwidth according to the simpleMultiCopy CUDASDK benchmark. Thus, task-0’s slowdown is due to slowerbidirectional bandwidth and not due to any possible MPI-ACC–related contention effects.

PCIe contention analysis with data transfers in the samedirection: For this study, we analyze contention effectswhen MPI_Send (task-0) is invoked concurrently withstandalone D2H transfers on the same GPU (task-1). Weanalyze the contention impacts of three MPI_Send imple-mentations: MPI-ACC, MVAPICH, and manual pipeliningusing asynchronous MPI and CUDA. Since the FermiGPUs have a single data transfer hardware queue in eachdirection, a task that oversubscribes the GPU can signif-icantly slow any other task that uses the same queue.In fact, we show that MPI-ACC induces less contentionthan MVAPICH and the manual asynchronous MPI+GPUapproaches of GPU data communication. We show thatMPI-ACC enqueues commands to the GPU hardware queuein a balanced manner, thereby minimizing the apparentperformance slowdown in the D2H task (task-1) whileincurring a modest slowdown to the MPI communication(task-0) itself.

Figure 13a shows the relative increase in the MPI latencydue to contention from the D2H task. For this experiment,the D2H task consistently transfers 16 MB between thedevice and the host, whereas the data size for the MPItask is varied. MPI-ACC shows a maximum slowdown ofabout 5% for relatively small MPI data transfers, and theslowdown for larger MPI data transfers is negligible, onaverage. The other implementations demonstrate less tonegligible slowdown for all data sizes.

Figure 13b shows the normalized bandwidth of the D2Htask when run concurrently with MPI. For this experiment,the MPI task consistently performs 16 MB GPU-GPU datatransfers across the network, whereas the data size for thelocal D2H task is varied. We see that MPI-ACC causes amaximum performance slowdown of 3.8% to the D2H taskfor relatively small data, and the performance slowdown forlarger D2H data sizes is negligible. However, MVAPICHand the manual asynchronous MPI+GPU implementationcauses a slowdown of about 18% for smaller D2H datatransfers. This result indicates that MPI-ACC enqueuesGPU tasks to the hardware queues in a more balanced man-ner, whereas MVAPICH may oversubscribe the hardwarequeues thereby causing significant performance variationsto the other GPU tasks.

HPCToolkit analysis: HPCTOOLKIT [29], [30] isa sampling-based performance analysis toolkit capableof quantifying scalability bottlenecks in parallel pro-grams. We use HPCTOOLKIT’s Hpctraceviewer in-terface to understand why MPI-ACC causes less con-tention than the manual MPI+GPU implementations do.

Page 13: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 13

0%

2%

4%

6%

256 512 1024 2048 4096 8192 16384

Re

lati

ve L

ate

ncy

In

cre

ase

wit

h G

PU

D

2H

Co

nte

nti

on

Data Size for MPI (KB)

MPI+GPU MVAPICH MPI-ACC

(a) Impact on the MPI_Send communication latency. Task-1 (D2H)transfers 16MB for all cases.

0%

25%

50%

75%

100%

125%

256 512 1024 2048 4096 8192 16384GP

U D

2H

No

rmal

ize

d

Ban

dw

idth

s

Data Size for GPU D2H (KB)

GPU D2H Only With MPI+GPU With MVAPICH With MPI-ACC

(b) Impact on the local D2H GPU operations. Task-0 (MPI) transfers16MB for all cases. MPI-ACC causes least contention for the D2H task.

Fig. 13: Impact of contention due to concurrent MPI_Sendand local D2H GPU operations. MPI-ACC is evaluated againstMVAPICH and the manual MPI+GPU implementations.

Hpctraceviewer renders hierarchical, timeline-basedvisualizations of parallel hybrid CPU-GPU programs. Fig-ure 14 presents screenshots of the detailed execution profileof our contention benchmark. The hpctraceviewer toolpresents the timeline information of all CPU processes,threads, and their corresponding CUDA streams. However,we zoom in and show only the timelines of the relevantCUDA streams associated with both tasks of the 0th pro-cess. The other process exhibits identical behavior and isexcluded from the figure.

Figure 14a shows the effect of MPI-ACC’s send opera-tion interacting with the D2H data transfers of task-1. Sinceboth tasks issue D2H commands and there is only one D2Hqueue on Fermi, we can see that only one of the CUDAstreams is active at any given point in time. Moreover,the MPI-ACC’s pipelining logic has been designed to issueGPU commands only when the next pipeline stage is ready.This design does not oversubscribe the GPU and leads tobalanced execution, which can be seen by the interleavedbars in the MPI-related timeline. Figure 14b depicts thecontention effect of the manual pipelined MPI+GPU im-plementation. In this example implementation, we enqueueall the pipeline stages upfront, which is an acceptabledesign for standalone point-to-point communication. Thisdesign oversubscribes the GPU, however, and can be seenas clusters of bars in the MPI-related timeline. Of course, ifone designs the manual MPI+GPU implementation similarto our MPI-ACC design, then the associated timeline figurewill look like Figure 14a. Since the manual MPI+GPUimplementation is more aggressive in enqueuing GPU oper-ations, the D2H operations of task-1 tend to wait more. Thatis why, on average, MPI-ACC causes the least performanceperturbation to the D2H task (Figure 13b).

7.3 SummaryIn this section, we provided insights into the scalable designof MPI-ACC and compared its performance with MVA-PICH and manual MPI+GPU implementations. Specifically,

Task 0: MPI_Send from GPU (MPI-ACC’s Internal Stream 0)

Task 1: D2H Loop (Custom Stream)

Task 0: MPI_Send from GPU (MPI-ACC’s Internal Stream 1)

Balanced GPU Requests

(a) Impact of MPI-ACC’s MPI_Send with concurrent D2H operations.

Task 0: MPI+CUDA send from GPU (Custom Stream 0)

Task 1: D2H Loop (Custom Stream)

Task 0: MPI+CUDA send from GPU (Custom Stream 1)

Aggressive GPU Requests

(b) Impact of manual MPI+GPU send task with concurrent D2H opera-tions.

Fig. 14: Using HPCTOOLKIT to understand the contention im-pacts of MPI-ACC and local GPU data transfer operations.

we showed that MPI-ACC delivers maximum concurrencyby carefully ordering multiple GPU streams and efficientlybalancing the H2D and D2H hardware queues for datapipelining, without oversubscribing the GPU resource.

8 CONCLUSION

In this paper, we introduced MPI-ACC, an integrated andextensible framework that allows end-to-end data move-ment in accelerator-connected systems. We discussed MPI-ACC’s API design choices and a comprehensive set ofoptimizations including data pipelining and buffer manage-ment. We studied the efficacy of MPI-ACC for scientificapplications from the domains of epidemiology (GPU-EpiSimdemics) and seismology (FDM-Seismology), andwe presented the lessons learned and tradeoffs. We foundthat MPI-ACC’s internal pipeline optimization not onlyhelps improve the end-to-end communication performancebut also enables novel optimization opportunities at theapplication level, which significantly enhance the CPU-GPU and network utilization. With MPI-ACC, one cannaturally express the communication target without explic-itly treating the CPUs as communication relays. MPI-ACCdecouples the application logic from the low-level GPUcommunication optimizations, thereby significantly improv-ing scalability and application portability across multipleGPU platforms and generations.

ACKNOWLEDGMENTThis work was supported in part by the DOE grant DE-SC0001770; bycontracts DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-ACO6-76RL01830; by the DOE Office of Science under cooperative agreementnumber DE-FC02-07ER25800; by the DOE GTO via grant EE0002758from Fugro Consultants; by the DTRA CNIMS contract HDTRA1-11-D-0016-0001; by the NSF PetaApps grant OCI-0904844; by the NSFBlue Waters grant OCI-0832603; by the NSF grants CNS-0960081, CNS-054630, and CNS-0958311; and by an NVIDIA Graduate Fellowship,NVIDIA Professor Partnership, and CUDA Research Center at VirginiaTech. This research used the HokieSpeed heterogeneous computing re-source at Virginia Tech, which is supported by the National ScienceFoundation under contract CNS-0960081. We thank Darius Buntinas andHao Wang for their valuable technical discussions and feedback.

Page 14: MPI-ACC: Accelerator-Aware MPI for Scientific Applications · Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 14

REFERENCES

[1] L. Weiguo, B. Schmidt, G. Voss, and W. Muller-Wittig, “StreamingAlgorithms for Biological Sequence Alignment on GPUs,” IEEETransactions on Parallel and Distributed Systems, vol. 18, no. 9,pp. 1270–1281, Sept. 2007.

[2] A. Nere, A. Hashmi, and M. Lipasti, “Profiling HeterogeneousMulti-GPU Systems to Accelerate Cortically Inspired LearningAlgorithms,” in IEEE Intl. Parallel and Distributed ProcessingSymposium (IPDPS), 2011.

[3] W.-C. Feng, Y. Cao, D. Patnaik, and N. Ramakrishnan, “TemporalData Mining for Neuroscience,” GPU Computing Gems, February2011, Emerald Edition.

[4] “TOP500 Supercomputer Sites,” http://www.top500.org.[5] MPI: A Message-Passing Interface Standard Version 2.2. Message

Passing Interface Forum, 2009.[6] “NVIDIA CUDA C Programming Guide,” 2014, http://docs.nvidia.

com/cuda/pdf/CUDA C Programming Guide.pdf.[7] L. Howes and A. Munshi, The OpenCL Specification. Khronos

OpenCL Working Group, 2014, https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf.

[8] “NVIDIA GPUDirect,” https://developer.nvidia.com/gpudirect.[9] C. L. Barrett, K. R. Bisset, S. G. Eubank, X. Feng, and M. V.

Marathe, “EpiSimdemics: An Efficient Algorithm for Simulating theSpread of Infectious Disease over Large Realistic Social Networks,”in Intl. Conference for High Performance Computing, Networking,Storage and Analysis (SC), 2008.

[10] K. R. Bisset, A. M. Aji, M. V. Marathe, and W.-c. Feng, “High-Performance Biocomputing for Simulating the Spread of Contagionover Large Contact Networks,” BMC Genomics, vol. 13, no. S2,April 2012.

[11] S. Ma and P. Liu, “Modeling of the Perfectly Matched LayerAbsorbing Boundaries and Intrinsic Attenuation in Explicit FiniteElement Methods,” Bulletin of the Seismological Society of America,vol. 96, no. 5, pp. 1779–1794, 2006.

[12] W. M. Brown, P. Wang, S. J. Plimpton, and A. N. Tharrington,“Implementing Molecular Dynamics on Hybrid High PerformanceComputers - Short Range Forces,” Computer Physics Communica-tions, vol. 182, no. 4, pp. 898–911, 2011.

[13] T. Endo, A. Nukada, S. Matsuoka, and N. Maruyama, “LinpackEvaluation on a Supercomputer with Heterogeneous Accelerators,” inIEEE Intl. Parallel and Distributed Processing Symposium (IPDPS),2010.

[14] A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset,and R. Thakur, “MPI-ACC: An Integrated and Extensible Approachto Data Movement in Accelerator-Based Systems,” in 14th IEEE Intl.Conference on High Performance Computing and Communications(HPCC), 2012.

[15] H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. K. Panda,“GPU-aware MPI on RDMA-enabled Clusters: Design, Implementa-tion and Evaluation,” IEEE Transactions on Parallel and DistributedSystems (TPDS), vol. 25, no. 10, pp. 2595–2605, 2014.

[16] “MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,” http://mvapich.cse.ohio-state.edu/.

[17] “MPICH: High-Performance Portable MPI,” http://www.mpich.org.[18] O. S. Lawlor, “Message Passing for GPGPU Clusters: cudaMPI,”

in IEEE Intl. Conference on Cluster Computing and Workshops(CLUSTER), 2009.

[19] J. Chen, I. Watson, W., and W. Mao, “GMH: A Message PassingToolkit for GPU Clusters,” in IEEE 16th Intl. Conference on Paralleland Distributed Systems (ICPADS), 2010.

[20] S. Yamagiwa and L. Sousa, “CaravelaMPI: Message Passing Inter-face for Parallel GPU-Based Applications,” in 8th Intl. Symposiumon Parallel and Distributed Computing (ISPDC), 2009.

[21] J. Stuart and J. Owens, “Message Passing on Data-Parallel Architec-tures,” in IEEE Intl. Parallel and Distributed Processing Symposium(IPDPS), 2009.

[22] S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda,“Extending OpenSHMEM for GPU Computing,” in IEEE Intl.Parallel and Distributed Processing Symposium (IPDPS), 2013.

[23] A. Sidelnik, B. L. Chamberlain, M. J. Garzaran, and D. Padua,“Using the High Productivity Language Chapel to target GPGPUArchitectures,” UIUC Dept. of Comp. Sci. Technical Report, 2011.

[24] J. Hammond, S. Ghosh, and B. Chapman, “Implementing OpenSH-MEM Using MPI-3 One-Sided Communication,” in OpenSHMEMand Related Technologies. Experiences, Implementations, and Tools,2014.

[25] J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tip-paraju, “Supporting the Global Arrays PGAS Model Using MPIOne-sided Communication,” in IEEE Intl. Parallel and DistributedProcessing Symposium (IPDPS), 2012.

[26] A. M. Aji, P. Balaji, J. Dinan, W.-c. Feng, and R. Thakur, “Syn-chronization and Ordering Semantics in Hybrid MPI GPU Program-ming,” in 3rd Intl. Workshop on Accelerators and Hybrid ExascaleSystems (IPDPSW: AsHES). IEEE, 2013.

[27] A. M. Aji, L. S. Panwar, F. Ji, M. Chabbi, K. Murthy, P. Balaji,K. R. Bisset, J. Dinan, W.-c. Feng, J. Mellor-Crummy, X. Ma, andR. Thakur, “On the Efficacy of GPU-Integrated MPI for ScientificApplications,” in ACM Intl. Symposium on High-Performance Par-allel and Distributed Computing (HPDC), 2013.

[28] J. Berenger, “A Perfectly Matched Layer for the Absorption ofElectromagnetic Waves,” Journal of Computational Physics, vol.114, no. 2, pp. 185–200, 1994.

[29] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, “HPCToolkit: Tools for PerformanceAnalysis of Optimized Parallel Programs,” Concurrency Computa-tion: Practice and Experience, vol. 22, pp. 685–701, April 2010.

[30] M. Chabbi, K. Murthy, M. Fagan, and J. Mellor-Crummey, “EffectiveSampling-driven Performance Tools for GPU-accelerated Super-computers,” in Intl. Conference for High Performance Computing,Networking, Storage and Analysis (SC), 2013.

Ashwin M. Aji is a Ph.D. candidate in the Dept. of Comp. Science,Virginia Tech, where he also received an M.S. degree.

Lokendra S. Panwar is an M.S. student in the Dept. of Comp.Science, Virginia Tech.

Feng Ji is a Ph.D. candidate in the Dept. of Comp. Science, NorthCarolina State University.

Karthik Murthy is a Ph.D. candidate in the Dept. of Comp. Science,Rice University. He received an M.A. in computer science from theUniversity of Texas, Austin.

Milind Chabbi is a Ph.D. candidate in the Dept. of Comp. Science,Rice University. He received an M.S. in computer science from theUniversity of Arizona, Tucson.

Pavan Balaji is a computer scientist at Argonne National Laboratory,a Fellow of the Northwestern-Argonne Inst. of Science and Engi-neering at Northwestern University, and a Research Fellow of theComputation Inst. at the Univ. of Chicago.

Keith R. Bisset is a simulation and system software developmentscientist at the Virginia Bioinformatics Institute, Virginia Tech.

James Dinan received his Ph.D. from the Ohio State University andcompleted a postdoctoral fellowship at Argonne National Laboratory.

Wu-chun Feng is a professor and Elizabeth & James E. Turner Fel-low in the Dept. of Comp. Science, Dept. of Electrical and ComputerEngineering, and Virginia Bioinformatics Institute at Virginia Tech.

John Mellor-Crummey is a professor in the Dept. of Comp. Scienceand the Dept. of Electrical and Computer Engineering, Rice Univer-sity.

Xiaosong Ma is a senior scientist at Qatar Computing ResearchInstitute and an associate professor in the Dept. of Comp. Science,North Carolina State University.

Rajeev Thakur is the Deputy Director of the Math. and Comp.Sci. Division at Argonne National Laboratory, a Senior Fellow inthe Computation Institute at the Univ. of Chicago, and an adjunctprofessor in the Dept. of Elect. Eng. and Comp. Sci. at NorthwesternUniversity.