research article a new parallel method for binary black

15
Research Article A New Parallel Method for Binary Black Hole Simulations Quan Yang, 1 Zhihui Du, 1 Zhoujian Cao, 2 Jian Tao, 3 and David A. Bader 4 1 Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2 Institute of Applied Mathematics, Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing 100190, China 3 Center for Computation & Technology, Louisiana State University, Baton Rouge, LA 70803, USA 4 School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Correspondence should be addressed to Zhihui Du; [email protected] Received 1 January 2016; Revised 25 May 2016; Accepted 6 June 2016 Academic Editor: Bormin Huang Copyright ยฉ 2016 Quan Yang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Simulating binary black hole (BBH) systems are a computationally intensive problem and it can lead to great scienti๏ฌc discovery. How to explore more parallelism to take advantage of the large number of computing resources of modern supercomputers is the key to achieve high performance for BBH simulations. In this paper, we propose a scalable MPM (Mesh based Parallel Method) which can explore both the inter- and intramesh level parallelism to improve the performance of BBH simulation. At the same time, we also leverage GPU to accelerate the performance. Di๏ฌ€erent kinds of performance tests are conducted on Blue Waters. Compared with the existing method, our MPM can improve the performance from 5x speedup (compared with the normalized speed of 32 MPI processes) to 8x speedup. For the GPU accelerated version, our MPM can improve the performance from 12x speedup to 28x speedup. Experimental results also show that when only enough CPU computing resource or limited GPU computing resource is available, our MPM can employ two special scheduling mechanisms to achieve better performance. Furthermore, our scalable GPU acceleration MPM can achieve almost ideal weak scaling up to 2048 GPU computing nodes which enables our so๏ฌ…ware to handle even larger BBH simulations e๏ฌƒciently. 1. Introduction e latest supercomputers [1] have signi๏ฌcantly increasing number of computing nodes/cores, but many practical appli- cations cannot achieve better performance on more comput- ing resources because enough parallelism in the applications has not been explored. At the same time, many applications cannot take advantage of supercomputers equipped with accelerator such as GPU [2, 3] because the existing codes cannot run on GPU directly. We focus on the two problems and propose an e๏ฌƒcient method to improve the performance of one kind of challenging scienti๏ฌc application (binary black hole simulations) on one typical large scale supercomputer, Blue Waters. A binary black hole (BBH) system has two black holes in close orbit around each other. For the inspiralling BBH, the two black holes will move around each other. e number of orbits means the number of circles the black holes move. Mass ratio means the ratio of the mass of small black hole to the mass of the big black hole. BBH systems are important because they would be the strongest known gravitational wave [4] source in the universe. e most challenging and important subject in numerical relativity now is the simula- tion of these BBH systems. For gravitational wave detection, theoretical model for gravitational wave sources is essential for experimental data analysis. As an example, the theoretical model for BBH played an important role in GW150914 detec- tion [5]. For ground based detectors of gravitational wave, such as LIGO, VERGO, GEO600, and KAGRA [6], the almost equal mass BBHs are the most important gravitational wave sources. For the space-based detectors, such as eLISA [7], the gravitational wave sources will include many large mass ratio BBH systems. When the mass ratio of BBH increases, the computational cost increases dramatically, roughly propor- tional to the fourth power of the mass ratio. Currently, the upper limit for the mass ratio of BBH which can be simulated by numerical relativity is 100. So, for the future space-based Hindawi Publishing Corporation Scienti๏ฌc Programming Volume 2016, Article ID 2360492, 14 pages http://dx.doi.org/10.1155/2016/2360492

Upload: others

Post on 28-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article A New Parallel Method for Binary Black

Research ArticleA New Parallel Method for Binary Black Hole Simulations

Quan Yang,1 Zhihui Du,1 Zhoujian Cao,2 Jian Tao,3 and David A. Bader4

1Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology,Tsinghua University, Beijing 100084, China2Institute of Applied Mathematics, Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing 100190, China3Center for Computation & Technology, Louisiana State University, Baton Rouge, LA 70803, USA4School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

Correspondence should be addressed to Zhihui Du; [email protected]

Received 1 January 2016; Revised 25 May 2016; Accepted 6 June 2016

Academic Editor: Bormin Huang

Copyright ยฉ 2016 Quan Yang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Simulating binary black hole (BBH) systems are a computationally intensive problem and it can lead to great scientific discovery.How to explore more parallelism to take advantage of the large number of computing resources of modern supercomputers is thekey to achieve high performance for BBH simulations. In this paper, we propose a scalable MPM (Mesh based Parallel Method)which can explore both the inter- and intramesh level parallelism to improve the performance of BBH simulation. At the same time,we also leverage GPU to accelerate the performance. Different kinds of performance tests are conducted on BlueWaters. Comparedwith the existing method, our MPM can improve the performance from 5x speedup (compared with the normalized speed of 32MPI processes) to 8x speedup. For the GPU accelerated version, our MPM can improve the performance from 12x speedup to 28xspeedup. Experimental results also show that when only enough CPU computing resource or limited GPU computing resourceis available, our MPM can employ two special scheduling mechanisms to achieve better performance. Furthermore, our scalableGPU acceleration MPM can achieve almost ideal weak scaling up to 2048 GPU computing nodes which enables our software tohandle even larger BBH simulations efficiently.

1. Introduction

The latest supercomputers [1] have significantly increasingnumber of computing nodes/cores, but many practical appli-cations cannot achieve better performance on more comput-ing resources because enough parallelism in the applicationshas not been explored. At the same time, many applicationscannot take advantage of supercomputers equipped withaccelerator such as GPU [2, 3] because the existing codescannot run on GPU directly. We focus on the two problemsand propose an efficient method to improve the performanceof one kind of challenging scientific application (binary blackhole simulations) on one typical large scale supercomputer,Blue Waters.

A binary black hole (BBH) system has two black holes inclose orbit around each other. For the inspiralling BBH, thetwo black holes will move around each other. The number oforbitsmeans the number of circles the black holesmove.Massratio means the ratio of the mass of small black hole to

the mass of the big black hole. BBH systems are importantbecause they would be the strongest known gravitationalwave [4] source in the universe. The most challenging andimportant subject in numerical relativity now is the simula-tion of these BBH systems. For gravitational wave detection,theoretical model for gravitational wave sources is essentialfor experimental data analysis. As an example, the theoreticalmodel for BBH played an important role in GW150914 detec-tion [5]. For ground based detectors of gravitational wave,such as LIGO,VERGO,GEO600, andKAGRA [6], the almostequal mass BBHs are the most important gravitational wavesources. For the space-based detectors, such as eLISA [7], thegravitational wave sources will include many large mass ratioBBH systems. When the mass ratio of BBH increases, thecomputational cost increases dramatically, roughly propor-tional to the fourth power of the mass ratio. Currently, theupper limit for the mass ratio of BBHwhich can be simulatedby numerical relativity is 100. So, for the future space-based

Hindawi Publishing CorporationScientific ProgrammingVolume 2016, Article ID 2360492, 14 pageshttp://dx.doi.org/10.1155/2016/2360492

Page 2: Research Article A New Parallel Method for Binary Black

2 Scientific Programming

detection of gravitational wave, improving the computationalability is quite important.

Adaptive mesh refinement (AMR) [8] is widely used inBBH simulation because of its simplicity and great reductionin total computing work. It is a natural method to divide eachmesh intomany submeshes and execute the submeshes of thesame level in parallel to improve the simulation speed.We callit Submesh based Parallel Method (SPM) in this paper. Halozones are needed for each submesh to keep the data from itsneighbors. SPM can really achieve very good parallel perfor-mance when it only uses limited computing resources. But ifthe time saving by parallel executing the smaller submeshescannot compensate the communication overhead to fill thehalo zones, SPM cannot scale to more computing resourcesto achieve better performance. The size of submeshes cannotbe too small because of the increasing communicationoverhead and the size of refined mesh cannot be very largebecause of the significant increase in total computation, sothe number of parallel submeshes (๐‘€๐‘’๐‘ โ„Ž๐‘†๐‘–๐‘ง๐‘’/๐‘†๐‘ข๐‘๐‘š๐‘’๐‘ โ„Ž๐‘†๐‘–๐‘ง๐‘’)cannot be too large. The latest supercomputers have moreand more computing nodes/cores, but the SPM cannot takeadvantage of more computing resources to further improveits performance. When we try to handle some challengingBBH simulationswithmore orbits (โ‰ฅ20) and largermass ratio(โ‰ฅ400), more parallelism must be explored to significantlyimprove the simulation performance to conduct the simula-tions in reasonable time.

We propose a novel Mesh based Parallel Method (MPM)to explore both the inter- and intramesh parallelism in BBHsimulations. MPM will explore the parallelism among differ-entmesh levels first (intermesh parallelism).Then, it employsSPM to explore the parallelism in each mesh (intrameshparallelism). MPM has two advantages. On one hand, forgiven submesh size, MPM can provide much more paral-lelism than SPM does. On the other hand, for given numberof parallel tasks,MPMcan assignmore data and computationto each parallel task than SPM does. In other words, thecomputation-to-communication ratio of MPM can be largerthan the SPM. SoMPMhas higher parallel efficiency than theSPM does.

We develop a new mesh partitioning algorithm to opti-mize the load and communication among all the paralleltasks. Our mesh partitioning algorithm is employed into ourMPM to achieve balanced load and reduce communication asmuch as possible. In order to take advantage of theGPU capa-bility to improve the performance, we rewrite the most com-putationally intensive solver codes in BBH simulations onGPU and employ different kind of GPU codes optimizationmethods (employing coalesced memory access and sharedmemory, reducing data copy between CPU and GPU, andremoving unnecessary synchronization) to improve its per-formance.

Different kinds of experiments on performance evalu-ation have been conducted on the large scale Blue Waterssupercomputer at National Center for SupercomputingApplications (NCSA). The experimental results show thatsignificant performance improvement can be achieved withour scalable MPM. Because the sequential code is too slow,we select the performance of 32 MPI processes with SPM

algorithm as the baseline.The existing method can achieve atmost 5x speedup; ourMPM can achieve 8x speedup. After weport and optimize the code on GPU, the existing method canachieve at most 12x speedup; our scalable MPM can achieveabout 28x speedup. Furthermore, our scalable MPM + GPUaccelerationmethod can achieve almost ideal weak scaling upto 2048 GPU computing nodes. This means that our methodcan handle very large problem on large number of computingresources efficiently.

The major contribution of our work lies in two aspects:First, the proposed MPM can enable the BBH simulation totake advantage of more computing resources of supercom-puters to achieve better performance. Second, the proposedMPM has been enabled by GPU to further improve the per-formance of BBH simulation.

2. Problem Description

We will briefly introduce the BBH simulation problem andthe numerical method used to solve the problem first. Then,a special mesh refinement method used in BBH simulation isgiven.

2.1. Equations for the Problem. In order to simulate BBHsystems in general relativity, the basic idea is to solve Einsteinโ€™sequations [9]. We adopt BSSN (Baumgarte-Shapiro-Shibata-Nakamura) [10] formalism which is widely accepted by thenumerical relativity community. The BSSN formalism isa conformal-traceless โ€œ3 + 1โ€ formulation of the Einsteinequations. In this formalism, the space-time is decomposedinto three-dimensional spacelike slices, described by three-metric ๐›พ

๐‘–๐‘—; its embedding in the four-dimensional space-time

is specified by extrinsic curvature๐พ๐‘–๐‘—and the variables, lapse

๐›ผ, and shift vector ๐›ฝ๐‘–, which specify a coordinate system. ๐บis the gravitational constant, and ๐‘ is the speed of light. Innumerical relativity, geometrical units are usually adoptedwhich lead ๐บ = ๐‘ = 1. In this paper, we follow the notationsof [11]. The related equation description about the problem isas follows:

๐œ•๐‘ก๐œ™ = ๐›ฝ

๐‘–๐œ™,๐‘–โˆ’1

6๐›ผ๐พ +

1

6๐›ฝ๐‘–

,๐‘–,

๐œ•๐‘ก๏ฟฝ๏ฟฝ๐‘–๐‘—= ๐›ฝ๐‘˜๏ฟฝ๏ฟฝ๐‘–๐‘—,๐‘˜

โˆ’ 2๐›ผ๏ฟฝ๏ฟฝ๐‘–๐‘—+ 2๏ฟฝ๏ฟฝ๐‘˜(๐‘–๐›ฝ๐‘˜

,๐‘—)โˆ’2

3๏ฟฝ๏ฟฝ๐‘–๐‘—๐›ฝ๐‘˜

,๐‘˜,

๐œ•๐‘ก๐พ = ๐›ฝ

๐‘–๐พ,๐‘–โˆ’ ๐ท2๐›ผ + ๐›ผ [๏ฟฝ๏ฟฝ

๐‘–๐‘—๏ฟฝ๏ฟฝ๐‘–๐‘—

+1

3๐พ2] ,

๐œ•๐‘ก๏ฟฝ๏ฟฝ๐‘–๐‘—= ๐›ฝ๐‘˜๏ฟฝ๏ฟฝ๐‘–๐‘—,๐‘˜

+ ๐‘’โˆ’4๐œ™

[๐›ผ๐‘…๐‘–๐‘—โˆ’ ๐ท๐‘–๐ท๐‘—๐›ผ]

TF

+ ๐›ผ (๐พ๏ฟฝ๏ฟฝ๐‘–๐‘—โˆ’ 2๏ฟฝ๏ฟฝ๐‘–๐‘˜๏ฟฝ๏ฟฝ๐‘˜

๐‘—) + 2๏ฟฝ๏ฟฝ

๐‘˜(๐‘–๐›ฝ๐‘˜

,๐‘—)

โˆ’2

3๏ฟฝ๏ฟฝ๐‘–๐‘—๐›ฝ๐‘˜

,๐‘˜,

๐œ•๐‘กฮ“๐‘–

= ๐›ฝ๐‘—ฮ“๐‘–

,๐‘—โˆ’ 2๏ฟฝ๏ฟฝ๐‘–๐‘—

๐›ผ,๐‘—

+ 2๐›ผ (ฮ“๐‘–

๐‘—๐‘˜๏ฟฝ๏ฟฝ๐‘˜๐‘—

โˆ’2

3๏ฟฝ๏ฟฝ๐‘–๐‘—๐พ,๐‘—+ 6๏ฟฝ๏ฟฝ๐‘–๐‘—

๐œ™,๐‘—) โˆ’ ฮ“๐‘—

๐›ฝ๐‘–

,๐‘—

+2

3ฮ“๐‘–

๐›ฝ๐‘—

,๐‘—+1

3๏ฟฝ๏ฟฝ๐‘˜๐‘–๐›ฝ๐‘—

,๐‘—๐‘˜+ ๏ฟฝ๏ฟฝ๐‘˜๐‘—๐›ฝ๐‘–

,๐‘˜๐‘—.

(1)

Page 3: Research Article A New Parallel Method for Binary Black

Scientific Programming 3

Mesh 0

Mesh 1

Mesh 2

Syn

SynSyn

Mesh 0 Mesh 1 Mesh 2

T

T/2

T/4

Figure 1: An example to show how different mesh levels are created to cover the black hole and how the different mesh levels are evolvedwith different time steps. Synchronization is necessary for neighbor mesh levels to exchange data.

Here,๐œŒ, ๐‘ , ๐‘ ๐‘–, and ๐‘ 

๐‘–๐‘—are source termswhich come frommatter.

For a vacuum space-time, ๐œŒ = ๐‘  = ๐‘ ๐‘–= ๐‘ ๐‘–๐‘—= 0. In the above

evolution equations,๐ท๐‘–is the covariant derivative associated

with three-metric ๐›พ๐‘–๐‘—, and โ€œTFโ€ indicates the trace-free part

of tensor objects. For the time evolution, we use 4th-orderRunge-Kutta (for short โ€œRKโ€ in the following parts) method.Spatial derivatives are computed using 4th-order finite dif-ferencing. The advection terms are approximated with anupwind finite differencing stencil, and other derivative termsare approximated with a centered stencil. More detailsregarding the equations and implementations can be foundin [12, 13].

2.2. Mesh Refinement Method. Mesh refinement is a tech-nique which allows a simulation to spend more time on theparts of the domain which are more interesting. The Berger-Oliger mesh refinement (or AMR) algorithm [8] is widelyused in BBH simulations. We use Fixed Mesh Refinement(FMR) in this work instead of AMR because of the followingtwo facts: (1) FMRwill not introducemore computation thanAMR in BBH simulations with the same accuracy; (2) FMRis easy to achieve load balancing and it is very critical toimprove the performance in large scale parallel computing.The subcycling technology is also introduced because it canfurther reduce the total simulation steps significantly. For ourFMRmethodwith subcycling, the closer to the black hole, thefiner the mesh, and the smaller the time step (typical in half),the more the evolution steps, but all the meshes at differentlevels will have the same grid points.

Figure 1 is an example of our FMR method with subcy-cling. For simplicity, we only show threemesh levels.They arethe coarsest Mesh 0, the refined Mesh 1, and the finest Mesh2. The finer mesh level will be closer to the black hole, butthe coarser mesh level will cover larger region. All the meshlevels will have the same number of grid points. When Mesh0 executes one simulation step with time step ๐‘‡, Mesh 1 willneed to execute two simulation steps to catch up with Mesh

0 because its time step (๐‘‡/2) is half of Mesh 0 and so on forthe following finer mesh levels. After each simulation step,themesh should synchronize with its neighbormesh levels toexchange the boundary data or get the updated data coveredby the refined mesh level.

Our FMR method with subcycling can achieve enoughaccuracy with significantly reduced total simulation stepsbecause it only simulates the BBH evolution procedurewith higher space and time resolution (more space pointsand smaller time step) when it is closer to the black holearea. In the following sections, we will introduce how wecan accelerate its simulation performance on large scalesupercomputers.

3. A Scalable Parallel Method

Wefirst explain the pros and cons of the existing SPMparallelmethod. Then, the basic idea and the algorithm design ofour new parallel are given to show how it can overcome theweaknesses of SPM by exploringmore parallelism to improvethe performance of BBH simulations.

3.1. Analysis of the Existing Method. The existing methoddivides each mesh level into submeshes and executes thesesubmeshes of the same mesh level in parallel, but thecalculation in different meshes will be executed in sequentialorder. This is the existing Submesh based Parallel Method(SPM).

Algorithm 1 is the pseudocode of the SPM; line (4)shows that when the total mesh levels is more than 1, it willcall the ๐‘Ÿ๐‘’๐‘๐‘ข๐‘Ÿ๐‘ ๐‘–V๐‘’ ๐‘ ๐‘ก๐‘’๐‘ procedure to execute all the othermesh levels. The SPM is a very simple and natural way tosimulate the BBH evolution. At the same time, it can achievehigh parallel efficiency when only small or even moderatenumber of parallel tasks are employed (the computation-to-communication ratio is high because the size of submeshescan be big enough for limited number of parallel tasks).

Page 4: Research Article A New Parallel Method for Binary Black

4 Scientific Programming

(1) procedure SPM Evolution(2) Input: meshes and boundary conditions(3) evolve one step on current level(4) if (total level > 1) call recursive step(1)(5) analyze the results(6) regrid if needed(7) Output: values on all the mesh grid points(8) end procedure(9) procedure recursive step(๐‘™๐‘’V๐‘’๐‘™)(10) for (๐‘– = 0; ๐‘– < 2; ๐‘– + +) do(11) evolve one step on current level(12) if (level < total level โˆ’ 1) then(13) call recursive step(level + 1)(14) end if(15) exchange data with related processes(16) end for(17) end procedure

Algorithm 1: Submesh based Parallel Method.

When larger scale supercomputers are available and wehope to further improve the BBH simulation performancewith more computing resources, SPM often cannot achievebetter performance on the latest supercomputers because itcan only explore limited parallelism. SPM cannot achievegood parallel efficiency when the size of the submesh isvery small. The halo zone is necessary for each submesh toexchange data with its neighbor submeshes and the overheadof communication to fill halo zone will increase quickly whenthe size of submesh is very small. So the number of parallelsubmeshes will be limited and this is the essential bottleneckwhich limits the SPM to achieve better scalability.

Figure 2(a) shows how different mesh levels in SMPmethod are evolved one by one in order. The green blocksare the calculation steps of different mesh level. The SPMwill execute the Runge-Kutta calculation steps from RK1 toRK7 in sequential order even though some of them can beparallelized.

Another weakness of SPM is that when we employ GPUto accelerate the computationally intensive part of eachparallel process, the computation time will be significantlyreduced but the communication time will not change. Thiswill lead to the quick decrease in parallel efficiency. The SPMcan really employ GPU to improve the performance of eachsingle parallel process, but it prevents the application fromscaling onto more computing resources as well.

So the SPM only performs well in small or mediatescale parallel computing. It cannot scale to larger scalecomputing resources. When GPU computing resources areemployed, the absolute performance can be improved at firstbut its scalability will become worse because of the lowercomputation-to-communication ratio. Another weakness ofthe SPM is that it cannot work together with GPU well. Thisis why we must design new scalable parallel algorithm for thechallenging BBH simulations on larger scale supercomputers.

3.2. Mesh Based Parallel Method. The basic idea of our newparallel method is exploring the high level parallelism amongall themesh levels first.Thismeans that we will remove all thecontrol dependence among different mesh levels introducedby the recursive execution of SPM and allow all the meshesto evolve in parallel if only the necessary data is ready. Then,we will explore the low level parallelism in each mesh levelif more parallelism is needed. We call this parallel method asMesh based Parallel Method (MPM).

Figure 2(b) shows how the different Runge-Kutta cal-culations in Figure 2(a) can be executed in parallel way.For one big physical step ๐‘‡, Mesh 0, Mesh 1, and Mesh 2have to execute Runge-Kutta calculation {RK1}, {RK2, RK5},and {RK3, RK4, RK6, RK7}, respectively, based on our FMRmethod. All the evolution steps of the same mesh level mustbe executed in sequence because of the data dependence,but the evolution steps from different mesh levels can beexecuted in parallel. This is very different from the SPM. Soour MPM will execute the steps as follows: parallel executingcalculation {RK1, RK2, RK3} โ†’ syn โ†’ parallel executingcalculation {RK4} โ†’ syn โ†’ parallel executing calcula-tion {RK5, RK6} โ†’ syn โ†’ parallel executing calculation{RK7} โ†’ syn. But SPM has to execute calculation from RK1to RK7 one by one in sequential order as follows: parallel exe-cuting calculation {RK1} โ†’ synโ†’ parallel executing calcu-lation {RK2} โ†’ synโ†’ โ‹… โ‹… โ‹… โ†’ parallel executing calculation{RK7} โ†’ syn. In this way, the MPM can save about half thesimulation time by executing more meshes in parallel.

Although the intermesh level parallel method was reallyemployed alone in some package without subcycling tech-nology, such as in [14], the inter- and intramesh levelparallel methods together have never been used in pack-ages/applications with subcycling technology.The subcyclingtechnology can significantly reduce the total simulation stepswithout losing simulation accuracy, but it will make theparallel execution among different mesh levels very difficultbecause the communication and synchronization featuresamong different mesh levels are very different from thefeatures in the same mesh level.

To solve this problem, we design a general and highlyefficient mechanism to implement both inter- and intrameshlevel communication. In our MPI [15] code implementation,all the data needed by the sameMPI process will be collectedfirst, then packed into one buffer, and sent to the corre-sponding process with nonblocking MPI communicationfunction. At the same time, the corresponding MPI processwill receive and unpack the data to get the halo zone dataor prolongation/restriction data. In this way, we can combinethe two communication patterns into one, reduce the totalnumber of MPI operations, and allow MPI communicationto overlap with computation.

Another problem is that, for large scale parallel comput-ing, the unbalanced load can significantly harm the paralleleffect. Our test also shows that the mesh partition methodcan greatly affect the performance of BBH simulations. Sowe design a new three-dimensional (3D) mesh partitioningalgorithm to achieve two objects: load balancing and lesscommunication.

Page 5: Research Article A New Parallel Method for Binary Black

Scientific Programming 5

RK3

Exec

utio

n tim

eRK4

RK6

RK7

RK2

RK1

RK5

Mesh 0 Mesh 1 Mesh 2(a) SPM executes different meshlevels one by one

Mesh 0 Mesh 1 Mesh 2

RK3

Exec

utio

n tim

e

RK4

RK6

RK7

RK2

RK1

RK5

(b) MPM executes differentmesh levels in parallel

Figure 2: An example to show the difference between SPM and MPM. The red line is control flow and blue line is communication flow.The green blocks are the calculation steps with Runge-Kutta (RK) method and the yellow cycles are the communication operations betweenneighbor mesh levels.

(1) procedureMesh Partition (๐‘€๐‘’๐‘ โ„Ž, ๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘ ,๐‘€๐‘–๐‘›๐‘€๐‘’๐‘ โ„Ž)(2) Input:๐‘€๐‘’๐‘ โ„Ž, ๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘ ,๐‘€๐‘–๐‘›๐‘€๐‘’๐‘ โ„Ž

(3) ๐‘ก๐‘š๐‘[] = 1

(4) while (๐‘ก๐‘š๐‘[0] ร— ๐‘ก๐‘š๐‘[1] ร— ๐‘ก๐‘š๐‘[2] < ๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘ ) do(5) select ๐‘‘ to maximize โŒˆ๐‘€๐‘’๐‘ โ„Ž[๐‘‘]/๐‘ก๐‘š๐‘[๐‘‘]โŒ‰

(6) if (โŒˆ๐‘€๐‘’๐‘ โ„Ž[๐‘‘]/(๐‘ก๐‘š๐‘[๐‘‘] + 1)โŒ‰ < ๐‘€๐‘–๐‘›๐‘€๐‘’๐‘ โ„Ž[๐‘‘]) break(7) ๐‘ก๐‘š๐‘[๐‘‘] + +

(8) end while(9) for (๐‘‘ = 0; ๐‘‘ โ‰ค 2; ๐‘‘ + +) do(10) ๐‘†๐‘ข๐‘๐‘€๐‘’๐‘ โ„Ž[๐‘‘] = โŒˆ๐‘€๐‘’๐‘ โ„Ž[๐‘‘]/๐‘ก๐‘š๐‘[๐‘‘]โŒ‰

(11) end for(12) Output: SubMesh(13) end procedure

Algorithm 2: 3D mesh partition algorithm.

Algorithm 2 is our method on how to divide a three-dimensional ๐‘€๐‘’๐‘ โ„Ž with given number of parallel processes๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘  andminimum submesh size๐‘€๐‘–๐‘›๐‘€๐‘’๐‘ โ„Ž as input param-eters. The partition result ๐‘†๐‘ข๐‘๐‘€๐‘’๐‘ โ„Ž will be returned as theoutput result. The basic idea of our method is that the sizeof submeshes should be close to each other to keep loadbalancing. At the same time, we will try to make the shapeof the submesh as cubic as possible to reduce the size of halozone (so less data will be transferred). From line (3) to line(7), we calculate the maximum number of processors whichwill be used to divide the given๐‘€๐‘’๐‘ โ„Ž in each dimension.Thetotal number of processors cannot be larger than the givenvalue ๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘  and the submesh size in each dimension cannotbe smaller than๐‘€๐‘–๐‘›๐‘€๐‘’๐‘ โ„Ž. Line (4) shows that we will divide

the dimensionwith the largest submesh size first, so the shapeof submesh cannot be far away from cube. Finally, we canget the size of the submesh which can meet all requirements(from line (8) to line (10)). It is easy to check that our 3Dmeshpartitioning algorithm can really find submeshes whose sizesare close to each other and whose shapes are close to cubeenough.

After the mesh at different levels has been partitioned,Algorithm 3 provides the SPMD (Single Program MultipleData) pseudocode of our MPM. In line (3), we first calculatethe total number of evolution steps of the current mesh level.Then, from line (4) to line (9), all the evolution steps of thecurrent level are executed. After each evolution step, line(6) shows that the related processes have to exchange data

Page 6: Research Article A New Parallel Method for Binary Black

6 Scientific Programming

(1) procedureMPM Evolution (๐‘€๐‘ฆ๐‘™๐‘’V๐‘’๐‘™)(2) Input: meshes and boundary conditions(3) ๐‘๐‘†๐‘ก๐‘’๐‘๐‘  = 2

๐‘€๐‘ฆ๐‘™๐‘’V๐‘’๐‘™

(4) for (๐‘˜ = 0; ๐‘˜ < ๐‘๐‘†๐‘ก๐‘’๐‘๐‘ ; ๐‘˜ + +) do(5) evolve one step on current level(6) exchange data with related processes(7) analyze the results(8) regrid if needed(9) end for(10) Output: values on all the mesh grid points(11) end procedure

Algorithm 3: Mesh based Parallel Method.

with each other. It will include two kinds of communicationpatterns: (1) updating the halo zone area; (2) prolonga-tion/restrictionwith neighbormesh level.Wewill analyze theresult after each simulation step (line (7)). If the black holechanged its position, we need to regenerate the mesh (line(8)).

4. Experimental Results

In this section, we first introduce the BBH simulation soft-ware which has integrated our MPM algorithm. Then, wedescribe the configuration of our hardware platforms andthe default BBH simulation setup in our experiments. Next,we evaluate our GPU implementation to show that its resultis correct and can match with the CPU result well. Finally,we show how our novel MPM can improve the performanceof BBH simulations with or without GPU acceleration. Twospecial scheduling mechanisms are given to show that ourMPM algorithm can work together with them to improvethe performance of BBH simulations under two practicalscenarios: (1) enough CPU resources are available, but noGPU resources are available; (2) only limited GPUs resourcesare available.

4.1. Simulation Software. All of themethods presented in thispaper have been implemented in a practical BBH simulationsoftware AMSS-NCKU [12, 16]. Figure 3 shows the structureof the AMSS-NCKU software system. It is an MPI programand its function can be divided into two parts, controler andsolver. The control part is written in C++ and the solver isin CUDA. For a multicore cluster node, it may have one ormultiple CPUs/GPUs. The user can decide how many MPIprocesses can be allocated for each cluster nodes. MultipleMPI processes in one node can share one GPU (we export๐ถ๐‘…๐ด๐‘Œ ๐ถ๐‘ˆ๐ท๐ด ๐‘€๐‘ƒ๐‘† = 1 to enable multiple MPI processeson a single node to share one GPU on Blue Waters) orrun exclusively on multiple GPUs. We redesign the meshpartitioning algorithm to enable the simulation control flowand the MPI communication mechanism to implement ourMPM.At the same time, the newmesh partitioning algorithmcan achieve load balancing and less communication cost.Nonblocking MPI communication is employed to overlapthe computation and communication so better performance

One

MPI program

nodecluster

One MPI process

Control Solver

One MPI process

Control Solver

In CUDA

Many-core GPUsMulticore CPUs

MPI processes

MPI processes

MPI processes

Many

nodescluster

In C++

Figure 3: The software structure of our GPU enabled BBH simula-tion with proposed MPM.

can be achieved. The solver of AMSS-NCKU is the mostcomputationally intensive part and it has hundreds of Fortranfunctions. We port the hundreds of Fortran functions toGPU in CUDA and then a great effort is taken to optimizethe large amount of CUDA codes on GPU to achieve betterperformance. The total source codes of AMSS-NCKU aremore than 100K lines.

4.2. Experimental Setup

4.2.1. Hardware Platforms. Most the experimental resultsprovided in this paper are from Blue Waters supercomputer(https://bluewaters.ncsa.illinois.edu). Blue Waters is a CrayXE6/XK7 system consisting of more than 22,500 XE6 com-pute nodes (each containing two AMD Interlagos 16-coreprocessors) augmented by more than 4200 XK7 computenodes (each containing one AMD Interlagos 16-core proces-sor and one NVIDIA GK110 โ€œKeplerโ€ accelerator) in a singleGemini interconnection fabric.

At the same time, we also test our program on other largeGPU clusters, such as Mole-8.5 (http://www.top500.org/system/176899) which contains 362 compute nodes; eachnode has two Intel๏ฟฝ Xeon๏ฟฝ E5520 processors and 6 NVIDIA2050GPUs and SuperMike-II (https://www.cct.lsu.edu/super-mike-ii) which contains 440 compute nodes; each node hastwo Intel Sandy Bridge Xeon๏ฟฝ E5-2670 octacore processorsand fifty nodes are equipped with two NVIDIA Tesla M2090GPUs. Both of them use InfiniBand network.

4.2.2. Scheduling Methods. We employ three schedulingmechanisms to evaluate the different parallel methods. Thefirst is Default Scheduling. For CPU jobs, it means that oneMPI process is assigned to one CPU core so that one CPUcomputing node with ๐พ cores will be assigned to ๐พ MPIprocesses. For GPU jobs, it means that one MPI process isassigned to one GPU computing node so that one computingnode with multiple CPU cores will have only one MPI pro-cess. For CPU jobs, the resource contention among different

Page 7: Research Article A New Parallel Method for Binary Black

Scientific Programming 7

MPI processes cannot be avoided and this will lead to perfor-mance reduction usingDefault Scheduling. ForGPU jobs, theDefault Scheduling will lead to low CPU and GPU utilizationbecause it cannot fully explore the capability of all hardwareresources. So we design and implement another two specialscheduling mechanisms. They are Exclusive Scheduling forCPU jobs and GPU-Shared Scheduling for GPU jobs (weexport ๐ถ๐‘…๐ด๐‘Œ ๐ถ๐‘ˆ๐ท๐ด ๐‘€๐‘ƒ๐‘† = 1 to enable multiple MPI taskson a single node to share one GPU on Blue Waters or Titansupercomputer). Exclusive Scheduling means that one MPIprocess will be assigned to one independent computing nodeno matter how many cores it has. GPU-shared Schedulingmeans that more than one MPI process will share one GPU.In practical scenarios, sometimes the system is not equippedwith GPU (or GPUs are not available temporarily), but it hasmany CPU computing nodes. Sometimes the system only haslimited GPUs. We implement the Exclusive Scheduling andGPU-Shared Scheduling to do experiments under the twopractical scenarios.

4.2.3. Default Application Configuration. The default config-uration for all the experiments is as follows. A typical meshsize (128 ร— 128 ร— 64) which is widely used in the current BBHsimulations is selected. The total number of mesh levels is 8.The execution time is the accumulated time of 5 big physicaltime steps.

4.3. Correctness Evaluation. The precision is very importantfor scientific applications. To evaluate the correctness of ourGPU code, we do the following two kinds of evaluationtests. The first is directly comparing the difference of RHScalculation between CPU results and GPU results. Figure 4shows how we conduct the experiments. We not only usethe real data as inputs but also generate randomly hundredsof arrays as different inputs. For the same input, we willexecute both the original CPU codes and the correspondingGPU CUDA codes. We develop a tool which can dump allthe intermediate results at different position of both CPUand GPU codes. In this way, we can compare the differencebetween the CPU result and the GPU result. For all thedifferent inputs, the maximal difference of the intermediateresults between CPU and GPU is less than 10โˆ’14. So the GPUresults are very close to the CPU results for different inputs.

The second evaluation is using a typical application forbinary black hole collision calculations (see Figure 5). Here,we use two spinless identical black holes head on to eachother along ๐‘ฆ-axis. Initially, these two black holes separate2.303M, with M being the total mass of the two blackholes. This configuration has been used for code testingby many numerical relativity groups. First, we compare theevolved dynamical variables to fourth-order Runge-Kuttatime integration. The difference between CPU results andGPU results is less than 10

โˆ’12 for the whole simulationprocess. Figure 5(a) gives an example of such comparison forevolved variable ๏ฟฝ๏ฟฝ

๐‘ฅ๐‘ฆon ๐‘ง = 0 plane at time ๐‘ก = 140M. These

plots are from SuperMike-II.Gravitational wave form is one of the most important

physical quantities for numerical relativity study. In thiswork,we use Newman-Penrose scalar ฮจ

4to represent gravitational

Dum

p an

d co

mpa

re in

term

edia

te re

sults1 1

CPU code module 1

2

GPU code module 1

2

CPU code module 2 GPU code module 2

Input arrays

ii

nn

CPU code module i GPU code module i

GPU code module nCPU code module n

Figure 4: Different kinds of synthetic inputs are generated for theCPU and correspondingGPU codes.We can dump and compare thecalculation results of CPU and GPU at different stages. In this way,we can analyze the numerical difference between CPU and GPUcalculation for the same input.

wave. This numerical technique is widely used in numericalrelativity community. Here, we have compared the gravi-tational wave form generated by CPU and GPU. The testsimulation time is 120M. In Figure 5(b), we can see that thegravitational waveforms generated from GPU and CPU BBHsimulation on Blue Waters, SuperMike-II, and Mole-8.5 arealmost identical to each other. All of these tests confirm thecorrectness of our GPU implementation for the numericalrelativity code.

4.4. Performance Improvement with Our Scalable MPM Algo-rithm. In this section, we will investigate how our MPMalgorithm can replace the existing SPM algorithm to achievebetter performance with both CPU and GPU codes. Theperformance of two special scheduling mechanisms for prac-tical scenario can be improved with our MPM algorithm too.Because the performance of sequential code is unacceptable,we use the execution time of 32 MPI processes of CPU-SPMcode as the baseline (5718 s) to calculate the speedup. In thissection, we will show the performance curve till the bestperformance is achieved.

4.4.1. Employing MPM Algorithm in CPU Version. Figure 6shows how our MPM algorithm can use more parallelprocesses to improve the performance of BBH simulation.The new CPU version MPM algorithm is marked as CPU-MPM and the existing one is marked as CPU-SPM.

Figure 6(a) shows that the performance of CPU-SPMis even better than the performance of CPU-MPM whenthe number of parallel MPI processes is no more than512. The reason is that MPM algorithm can explore moreparallelism with the cost of some idle computing resources.SPM algorithm cannot scale to more MPI processes, but it

Page 8: Research Article A New Parallel Method for Binary Black

8 Scientific Programming

Y(M

)

CPU-GPUCPUGPU

โˆ’0.15

โˆ’0.1

โˆ’0.05

0

0.05

0.1

0.15

X (M)3210โˆ’1โˆ’2โˆ’3โˆ’4โˆ’5

4 5

32

10

โˆ’1โˆ’2

โˆ’3โˆ’4

โˆ’5

45

Y(M

)

X (M)3210โˆ’1โˆ’2โˆ’3โˆ’4โˆ’5

4 5

32

10

โˆ’1โˆ’2

โˆ’3โˆ’4

โˆ’5

45

โˆ’6e โˆ’ 014

โˆ’4e โˆ’ 014

โˆ’2e โˆ’ 014

0

2e โˆ’ 014

4e โˆ’ 014

6e โˆ’ 014

(a) Dynamic variables comparison shows that the CPU results are very close to GPU results

Mike-CPUMike-GPUBW-CPU

BW-GPUMole-CPUMole-GPU

20 40 60 80 100 1200t (M)

โˆ’0.015

โˆ’0.01

โˆ’0.005

0

0.005

0.01

0.015

Re[Rฮจ

4,22]

(1/M

)

(b) The waveform comparison shows that the CPU results and GPU resultsfrom different platforms are very close to each other

Figure 5: Correctness evaluation.

will keep all the computing resources busy. At first, whenthe number of MPI processes is small, SPM algorithm canfully take advantage of the given parallel resources andachieve higher performance. Since MPM algorithm will leadsome computing resources to be idle when not all the meshlevels can be executed in parallel, the performance of MPMalgorithm is lower than SPM. However, when more than512 MPI processes are used, the parallel efficiency of SPMalgorithm will become lower and lower because the totalcomputation for each MPI process will become smaller andsmaller. This is why the performance of SPM will becomeworse when more MPI processes are used. Compared withSPMalgorithm,MPMalgorithm can usemoreMPI processesto execute different mesh levels in parallel without signif-icantly decreasing the computation for each MPI process.So the parallel efficiency of MPM is much better than SPMwhenmore andmoreMPI processes are used.This is why theperformance of SPM is better than the performance of MPMat first and finally MPM can scale to larger number of MPI

processes to improve the performance of BBH simulation.The best performance of CPU-SMP with 512 MPI processesis 1096 s.The best performance of CPU-MPMwith 2048MPIprocesses is 707 s.

Figure 6(b) shows that CPU-SPM can achieve about5x speedup, but CPU-MPM can achieve about 8x speedupcompared with the baseline.

4.4.2. Employing MPM Algorithm in GPU Version. After weverify the effect of the MPM algorithm in CPU codes, weport and optimize the CPU codes onto GPU, replace theSPM algorithm in our GPU codes, and get the scalable GPUacceleration version. Figure 7 shows howourMPMalgorithmcan further improve the performance of our GPU codes. TheGPU codes with MPM algorithm and SPM algorithm aremarked as GPU-MPM and GPU-SMP, respectively.

Figure 7(a) shows that the performance of all our opti-mizedGPU codes is significantly better than the performance

Page 9: Research Article A New Parallel Method for Binary Black

Scientific Programming 9

CPU-SPMCPU-MPM

512

1024

2048

4096

8192

16384

Tim

e (s)

64 128 256 512 1024 204832Num of MPI processes

(a) Absolute execution time

CPU-SPMCPU-MPM

64 128 256 512 1024 204832Num of MPI processes

0

2

4

6

8

Nor

mal

ized

spee

dup

(b) Normalized speedup

Figure 6: The strong scaling results of MPM algorithm and SPM algorithm in the CPU implementations.

of CPU-SPM. At the same time, the performance of GPU-SPM is also better than the performance of GPU-MPMwhenthe number of parallel MPI processes is no more than 256(the reason is similar to that in the CPU codes). The bestGPU-SPM performance is 479 s with 256 MPI processes.Beyond 256MPI processes, GPU-SPM cannot use moreMPIprocesses to achieve better performance, but GPU-MPM canscale up to 1024MPI processes (1024GPUs) to achieve its bestperformance (203 s). The results show that employing GPUand the scalable MPM algorithm are the two major reasonsto achieve the performance improvement.

Figure 6(b) shows that GPU-SPM can achieve about 12xspeedup but GPU-MPM can achieve about 28x speedup.TheMPM algorithm can achieve about 479/203 โ‰ˆ 2.36x speedupfor GPU codes but only about 1096/707 โ‰ˆ 1.55x speedupfor CPU codes (see Section 4.4.1). This result shows that ourMPM algorithm can work together with GPU codes better toachieve more significant performance improvement.

4.4.3. Employing MPM Algorithm in Two Special SchedulingMechanisms. In this section, we will show how our MPMalgorithm can work together with two special schedul-ing mechanisms, Exclusive Scheduling and GPU-SharedScheduling, to improve their performance.

Exclusive Scheduling. Here each XE computing node will beassignedwith only oneMPI process and the results are shownin Figure 8. Figure 8(a) shows the execution time of the Exclu-sive Scheduling with the existing SPM algorithm (marked asCPU-SMP-Ex) and our MPM algorithm (marked as CPU-MPM-Ex). CPU-SPM-Ex can scale up to 512 XE computingnodes to achieve the best performance 807 s. CPU-MPM-Excan even scale up to 2048 XE computing nodes to achieveits best performance 510 s. Exclusive Scheduling can reallyachieve better performance than the corresponding Default

Scheduling for both SPM (1.36x speedup) and MPM (1.39xspeedup) algorithm by avoiding resource contention.

Figure 8(b) shows that CPU-SPM-Ex and CPU-MPM-Ex can achieve about 7x and 11x speedup, respectively. Thebest performance of CPU-MPM-Ex with 512 MPI processesis very close to the best performance of GPU-SPM with 256MPI processes (256 XK computing nodes). It shows that,with enough CPU computing resources, ourMPM algorithmcan achieve about the same performance improvement asthe GPU acceleration version with exiting SPM algorithm.Considering the major problem for large scale applications isthat they cannot take advantage ofmore computing resourcesto improve their performance, MPM + Exclusive Schedulingis a feasible way to improve the simulation performance.

GPU-Shared Scheduling. When only limited GPU resourcesare available, we can employ GPU-Shared Scheduling toimprove the performance of GPU jobs. Figure 9(a) shows theexecution time of GPU-Shared Scheduling with the existingSPM algorithm (marked as GPU-SMP-S8) and the MPMalgorithm (marked as GPU-MPM-S8) when one GPU (oneXK computing node) is shared by 8 MPI processes here.The results show that when GPU-Shared Scheduling workstogether with the existing SPM algorithm, the performancewill be even worse than the CPU codes with Default Schedul-ing because of its low computation-to-computation ratio.However, whenGPU-Shared Scheduling works together withour MPM algorithm, GPU-MPM-S8 can execute 2048 MPIprocesses on 256 XK computing nodes (256 GPUs) using476 s. Compared with 717 s of GPU-MPMโ€™s performancewith 256 MPI processes (they have the same number of XKcomputing nodes but different number of MPI processes),about 717/476 โ‰ˆ 1.5x performance improvement can beachieved.

Page 10: Research Article A New Parallel Method for Binary Black

10 Scientific Programming

CPU-SPMGPU-SPMGPU-MPM

128

256

512

1024

2048

4096

8192Ti

me (

s)

64 128 256 512 102432Num of MPI processes

(a) Absolute execution time

CPU-SPMGPU-SPMGPU-MPM

64 128 256 512 102432Num of MPI processes

1

5

9

13

17

21

25

29

Nor

mal

ized

spee

dup

(b) Normalized speedup

Figure 7: The strong scaling results of MPM algorithm and SPM algorithm in the GPU implementations.

CPU-SPM

CPU-MPM

CPU-SPM-Ex

CPU-MPM-Ex

64 128 256 512 1024 204832Num of MPI processes

256

512

1024

2048

4096

8192

16384

Tim

e (s)

(a) Absolute execution time

CPU-SPM

CPU-MPM

CPU-SPM-Ex

CPU-MPM-Ex

64 128 256 512 1024 204832Num of processes

0

4

8

12

Nor

mal

ized

spee

dup

(b) Normalized speedup

Figure 8: The strong scaling results of MPM algorithm and SPM algorithm with Exclusive Scheduling.

Figure 9(b) shows that the GPU-MPM-S8 version canachieve about 12x speedup with 2048MPI processes (256 XKcomputing nodes or 256 GPUs).This is better than the GPU-MPM versionโ€™s 8x speedup with 256 MPI processes on 256XK computing nodes.The results show that when the numberof available GPUs is limited, our MPM algorithm can worktogether with GPU-Shared Scheduling to fully explore thecapability of powerful GPU to improve the performance BBHsimulations.

4.5. Weak Scaling of Our MPM Algorithm. To evaluate theweak scalability of our method, we start from mesh size 320ร— 320 ร— 160 and use 32 MPI processes to solve the problem.When we double the number of MPI processes, we alsodouble the mesh size. Only two mesh levels are employed toreduce the total experimental time.The total simulation timeis 5 big physical time steps.

Figure 10(a) shows when the number of MPI processes isnot large, the weak scaling results of SPM are good. But SPM

Page 11: Research Article A New Parallel Method for Binary Black

Scientific Programming 11

CPU-SPMGPU-SPM-S8GPU-MPM-S8

64 128 256 512 1024 204832Num of MPI processes

256

512

1024

2048

4096

8192Ti

me (

s)

(a) Absolute execution time

CPU-SPMGPU-SPM-S8GPU-MPM-S8

1

3

5

7

9

11

Nor

mal

ized

spee

dup

64 128 256 512 1024 204832Num of MPI processes

(b) Normalized speedup

Figure 9: The strong scaling results of MPM algorithm and SPM algorithm with GPU-Shared Scheduling.

CPU-SPMIdeal scaling

GPU-SPMIdeal scaling

64 128 256 512 1024 204832Num of MPI processes

0

150

300

450

Exec

utin

g tim

e (s)

(a) SPM

CPU-MPMIdeal scaling

GPU-MPMIdeal scaling

0

150

300

450

600

Exec

utin

g tim

e (s)

64 128 256 512 1024 204832Num of MPI processes

(b) MPM

Figure 10: The weak scaling results of MPM algorithm and SPM algorithm.

cannot achieve good weak scaling result when more MPIprocesses are used. Figure 10(b) shows that the execution timeof CPU-MPM has very little increase when the number ofMPI processes increases from 32 to 2048. For the GPU-MPMcurve, it is even flat and very close to the ideal weak scalingline. The good strong scaling result of our MPM is shownin Sections 4.4.1 and 4.4.2. This section shows that MPMespecially the GPU-MPM version can achieve almost idealweak scaling results. So our GPU-MPM can not only reducesignificantly the execution time but also solve very large

BBH simulation problem with more computing resourcesefficiently.

5. Related Work

5.1. BBH Simulation and the Related Codes. The first directdetection of gravitational wave [5] shows that the BBHsimulation is essential for gravitational wave data analysisand the parameter estimation of the gravitational wavesource.Thematched filtering technique requires the accurate

Page 12: Research Article A New Parallel Method for Binary Black

12 Scientific Programming

gravitational waveforms, which can only be generated byBBH simulation to find the weak gravitional wave signal.The parameter estimation based on BBH simulation can tellus detailed information about the detected gravational wave.However, BBH simulation is very computationally intensive.Large mass ratio BBH simulation which is necessary forfuture space-based detectors is even challenging because thecomputational cost is about propotional to the fourth powerof the mass ratio. Furthermore, delevoping a practical BBHsimulation application is also a very challengingwork becauseof its complexity.

Till now, there are only about 10 independent numericalrelativity codes for binary black hole simulations [17] in theworld. CCATIE, Einsetein Toolkit, LazEv, Lean, MayaKranc,and UIUC are based on Cactus platform. AMSS-NCKU,BAM, Hahndol, PU, SACRA, SpEC, and GRChombo [18] arebased on different independent package.These codes employdifferent methods to handle two key problems, the numer-ical accuracy and computing efficiency. Since the computerhardware increases much faster, how to take advantage ofthe large scale compuer resources to significantly improvethe performance of BBH simulation has become a criticalproblem for all the BBH simulation applications. The earlywork of Cao et al. 2008 [12] reinvestigation shows that AMSS-NCKU is in the state of art, so we select it as the baseline ofour method.

5.2. Mesh Refinement Methods. Dubey et al. [19] give adetailed comparison on six publicly available, active, andexisting at least half a decade adaptive mesh refinement(AMR) packages, BoxLib, Cactus, Chombo, Enzo, FLASH,and Uintah. They can be very different in the refine factor,time step ratio, language, and so forth because of the diverserequirements in specific domain areas and general function-ality. Subcycling is often used to reduce the total amountof calculation when the grid points of coarse mesh level isnot large. For structured AMR with subcycling, executingthe submeshes of the same mesh level in parallel is a verygenral and widely adopted idea to improve the performance,such as the method provided by Diachin et al. [20]. However,the SPM liked parallel method cannot scale to large scalecomputing resources because the communication among thesubmeshes will be too costly when the size of submesh istoo small. Nonsubcycling will lead to an easy way to executeall the different mesh levels in parallel, such as PARAMESH[14]. However, this parallel method is too expensive if thenumber of grid points in finer mesh level is large. The twoparallel methods have been developed independently and wehave not seen the work to combine them together becausethey are used in very different scenarios. In this paper, wepropose our MPM, which can explore the parallelism amongthe different mesh levels and the submeshes of onemesh levelin BBH simulation. The advantage of the proposed MPMlies in that it can take advantage of the large scale powerfulsupercomputers to improve its performance.

5.3. GPUOptimizationMethods. In general relativity studies,GPU has been used in solving the Teukolsky equation [21]and post-Newtonian evolution [22]. For numerically solving

3D Einsteinโ€™s equations, some groups are trying to apply GPUto numerical relativity code [23]. Zink [24] implements ageneral relativistic evolution CUDA code on GPU and morethan 20x speedup can be achieved. However, it can only runon one GPU with single precision. So it can not handle largescale problem and single precision is often unacceptable formany applications.There are also ongoing efforts fromCactusgroup. There are many cases to employ GPU to improve theperformance of different applications [25โ€“29]. Even though,for large applications, porting their CPU codes to GPU is anerror-prone task, at the same time it is hard and tedious. Sosome compilers, such as [30, 31], are developed to transformCPU codes into GPU CUDA codes automatically. Lutz etal. work [32] can transform stencil computation onto multi-GPU automatically. Yang et al. [33] optimize the memoryaccess of existing CUDA code. Considering thousands ofFortran codes will be rewritten in CUDA, we borrow thoseideas and develop some compiler-like tool to make someregular code porting from CPU to GPU efficiently andautomatically.This can really reduce the total work and avoiderrors.

Optimization methods are very important for practicalapplications to achieve significant performance improve-ment on GPUs. There are excellent general suggestionsand principals on GPU code optimization. From the viewof hardware, NVIDIAโ€™s suggestions [34] are improving thehardware utilization and the throughput of both memoryand instruction. For different applications, the performancebottleneck maybe different. Ryoo et al. work has great impactonGPUoptimization [35] and they give three basic principlesbased on their work on optimizing different applications onGPU: improving the percentage of floating point instructions,reducing/overlapping the global memory access latency, andreleasing the pressure of global memory bandwidth. Studyon GPU speedup [36] indicates that the magic number ofGPU speedup often has no meaning if the CPU codes arenot optimized or parallelized. Only the optimized results canshow the real difference between CPU and GPU codes. Thisis why we compare the best performance of CPU and GPUresults.

6. Conclusion and Future Work

How to take advantage of the latest supercomputer to signif-icantly reduce the execution time of large scale applicationssuch as BBH simulations is a real challenging problem. Inthis paper, we propose a novel scalable parallel method(MPM) which can explore both the inter- and intrameshparallelism to gain more parallelism. So our MPM can takeadvantage of more computer resources efficiently to improvethe performance of BBH simulations. The proposed MPMcan improve the performance of both CPU and GPU BBHsimulation codes. At the same time, it can work togetherwith two practical scenarios to improve the real applicationโ€™sperformance.

Our experimental results show that the existing SPM canonly achieve 5x speedup for CPU codes and 12x speedup forGPU codes. But our MPM can achieve 8x speedup for CPUcodes and 28x speedup for GPU codes. This results show

Page 13: Research Article A New Parallel Method for Binary Black

Scientific Programming 13

that it is necessary for us to take advantage of both hardwareresources and the well designed algorithm to achieve betterperformance.

We are working together with the Cactus group andtrying to integrate our method into their package to thewhole community. Currently, our parallel mesh refinementmethod supports the maximum parallelism among all themeshes, but some computing resources will be idle duringsome simulation stages. In the next step, we will improveour method which can support the average parallelism ofthe meshes to improve the resource utilization. Furthermore,with smaller computing resources providing the same averageparallelism, the communication overhead can obviously bereduced.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research is supported in part byNationalNatural ScienceFoundation of China (nos. 61440057, 61272087, 61363019, and61073008), Beijing Natural Science Foundation (nos. 4082016and 4122039), the Sci-Tech Interdisciplinary Innovation andCooperation Team Program of the Chinese Academy ofSciences, the Specialized Research Fund for State Key Lab-oratories, and NSF Awards ACI-1265434 and ACI-1339745.Parts of experiments are conducted on HPC System Mole-8.5 constructed by Institute of Process Engineering, ChineseAcademy of Sciences.

References

[1] H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon, Top500supercomputing sites, 2015.

[2] J. D. Owens, D. Luebke, N. Govindaraju et al., โ€œA survey ofgeneral-purpose computation on graphics hardware,โ€ComputerGraphics Forum, vol. 26, no. 1, pp. 80โ€“113, 2007.

[3] J. Nickolls and W. J. Dally, โ€œThe GPU computing era,โ€ IEEEMicro, vol. 30, no. 2, pp. 56โ€“69, 2010.

[4] R. Cowen, โ€œTelescope captures view of gravitational waves,โ€Nature, vol. 507, no. 7492, pp. 281โ€“283, 2014.

[5] B. Abbott, R. Abbott, T. Abbott et al., โ€œObservation of gravita-tional waves from a binary black hole merger,โ€ Physical ReviewLetters, vol. 116, no. 6, Article ID 061102, 2016.

[6] D. Blair, L. Ju, C. Zhao et al., โ€œGravitational wave astronomy:the current status,โ€ Science China: Physics, Mechanics andAstronomy, vol. 58, no. 12, Article ID 120402, pp. 1โ€“41, 2015.

[7] P. Amaro-Seoane, S. Aoudia, S. Babak et al., โ€œLow-frequencygravitational-wave science with eLISA/NGO,โ€ Classical andQuantum Gravity, vol. 29, no. 12, Article ID 124016, 2012.

[8] M. J. Berger and J. Oliger, โ€œAdaptive mesh refinement for hyper-bolic partial differential equations,โ€ Journal of ComputationalPhysics, vol. 53, no. 3, pp. 484โ€“512, 1984.

[9] A. Einstein, โ€œThe foundation of the general theory of relativity,โ€Annalen der Physik, vol. 354, no. 7, pp. 769โ€“922, 1916.

[10] M. Campanelli, C. O. Lousto, P. Marronetti, and Y. Zlochower,โ€œAccurate evolutions of orbiting black-hole binaries without

excision,โ€ Physical Review Letters, vol. 96, no. 11, Article ID111101, 2006.

[11] T.W. Baumgarte and S. L. Shapiro,Numerical Relativity: SolvingEinsteinโ€™s Equations on the Computer, Cambridge UniversityPress, 2010.

[12] Z. Cao, H.-J. Yo, and J.-P. Yu, โ€œReinvestigation of movingpunctured black holes with a new code,โ€ Physical Review Dโ€”Particles, Fields, Gravitation and Cosmology, vol. 78, no. 12,Article ID 124011, 2008.

[13] D. Hilditch, S. Bernuzzi, M. Thierfelder, Z. Cao, W. Tichy, andB. Brugmann, โ€œCompact binary evolutions with the z4c for-mulation,โ€ Physical Review Dโ€”Particles, Fields, Gravitation andCosmology, vol. 88, no. 8, Article ID 084057, 2013.

[14] P. MacNeice, K. M. Olson, C. Mobarry, R. de Fainchtein, andC. Packer, โ€œPARAMESH: a parallel adaptive mesh refinementcommunity toolkit,โ€ Computer Physics Communications, vol.126, no. 3, pp. 330โ€“354, 2000.

[15] W.Gropp, E. Lusk, andA. Skjellum,UsingMPI: Portable ParallelProgramming with the Message-Passing Interface, vol. 1, MITPress, 1999.

[16] P. Galaviz, B. Brugmann, and Z. Cao, โ€œNumerical evolution ofmultiple black holes with accurate initial data,โ€ Physical ReviewDโ€”Particles, Fields, Gravitation and Cosmology, vol. 82, no. 2,Article ID 024005, 2010.

[17] B. Aylott, J. G. Baker, W. D. Boggs et al., โ€œTesting gravitational-wave searches with numerical relativity waveforms: results fromthe first numerical injection analysis (ninja) project,โ€ Classicaland Quantum Gravity, vol. 26, no. 16, Article ID 165008, 2009.

[18] K. Clough, P. Figueras, H. Finkel, M. Kunesch, E. A. Lim,and S. Tunyasuvunakool, โ€œGrchombo: numerical relativity withadaptive mesh refinement,โ€Classical and QuantumGravity, vol.32, no. 24, Article ID 245011, 2015.

[19] A. Dubey, A. Almgren, J. Bell et al., โ€œA survey of high levelframeworks in block-structured adaptive mesh refinementpackages,โ€ Journal of Parallel and Distributed Computing, vol.74, no. 12, pp. 3217โ€“3227, 2014.

[20] L. F. Diachin, R. Hornung, P. Plassmann, and A. Wissink,Parallel AdaptiveMesh Refinement, vol. 143 of Parallel Processingfor Scientific Computing, 2006.

[21] G. Khanna and J. McKennon, โ€œNumerical modeling of gravita-tional wave sources accelerated by OpenCL,โ€ Computer PhysicsCommunications, vol. 181, no. 9, pp. 1605โ€“1611, 2010.

[22] F. Herrmann, J. Silberholz, M. Bellone, G. Guerberoff, andM. Tiglio, โ€œIntegrating post-Newtonian equations on graphicsprocessing units,โ€ Classical and Quantum Gravity, vol. 27, no. 3,Article ID 032001, 2010.

[23] B. Brugmann, โ€œA pseudospectral matrix method for time-dependent tensor fields on a spherical shell,โ€ Journal of Com-putational Physics, vol. 235, pp. 216โ€“240, 2013.

[24] B. Zink, โ€œA general relativistic evolution code on cuda archi-tectures,โ€ in Proceedings of the LCI Conference on High-Per-formance Clustered Computing, Champaign, Ill, USA, 2008.

[25] Y. Liu, Z. Du, S. K. Chung, S. Hooper, D. Blair, and L. Wen,โ€œGpu-accelerated low-latency real-time searches for gravita-tional waves from compact binary coalescence,โ€ Classical andQuantum Gravity, vol. 29, no. 23, Article ID 235018, 2012.

[26] J. Michalakes and M. Vachharajani, โ€œGpu acceleration ofnumerical weather prediction,โ€ in Proceedings of the IEEE Inter-national Symposium on Parallel and Distributed Processing(IPDPS โ€™08), pp. 1โ€“7, Miami, Fla, USA, April 2008.

Page 14: Research Article A New Parallel Method for Binary Black

14 Scientific Programming

[27] R. S. Oliveira, B. M. Rocha, D. Burgarelli, W. Meira Jr., and R.W. dos Santos, โ€œSimulations of cardiac electrophysiology com-bining gpu and adaptive mesh refinement algorithms,โ€ in Bioin-formatics and Biomedical Engineering, pp. 322โ€“334, Springer,2016.

[28] H. Y. Schive, Y. C. Tsai, and T. Chiueh, โ€œGamer: a graphicprocessing unit accelerated adaptive-mesh-refinement code forastrophysics,โ€ The Astrophysical Journal Supplement Series, vol.186, no. 2, 2010.

[29] M. Schwarz and M. Stamminger, โ€œFast gpu-based adaptivetessellation with cuda,โ€ in Computer Graphics Forum, vol. 28,pp. 365โ€“374, Wiley Online Library, 2009.

[30] S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio, C. Tenl-lado, and F. Catthoor, โ€œPolyhedral parallel code generation forCUDA,โ€ACMTransactions on Architecture and Code Optimiza-tion (TACO), vol. 9, no. 4, article 54, 2013.

[31] M. Khan, P. Basu, G. Rudy, M. Hall, C. Chen, and J. Chame,โ€œA script-based autotuning compiler system to generate high-performance CUDA code,โ€ Transactions on Architecture andCode Optimization, vol. 9, no. 4, article 31, 2013.

[32] T. Lutz, C. Fensch, and M. Cole, โ€œPARTANS: an autotuningframework for stencil computation on multi-GPU systems,โ€Transactions on Architecture and Code Optimization, vol. 9, no.4, article 59, 2013.

[33] Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou, โ€œA Unifiedoptimizing compiler framework for different GPGPU architec-tures,โ€ Transactions on Architecture and Code Optimization, vol.9, no. 2, article 9, 2012.

[34] NVIDIA, Cuda c programming guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[35] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk,and W.-M. W. Hwu, โ€œOptimization principles and applicationperformance evaluation of a multithreaded GPU using CUDA,โ€in Proceedings of the 13th ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming (PPoPP โ€™08), pp.73โ€“82, ACM, Salt Lake City, Utah, USA, February 2008.

[36] V.W. Lee, C. Kim, J. Chhugani et al., โ€œDebunking the 100XGPUvs. CPUMyth: An evaluation of throughput computing onCPUand GPU,โ€ in Proceedings of the 37th International SymposiumonComputer Architecture (ISCA 2010), pp. 451โ€“460, Saint-Malo,France, June 2010.

Page 15: Research Article A New Parallel Method for Binary Black

Submit your manuscripts athttp://www.hindawi.com

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Applied Computational Intelligence and Soft Computing

โ€‰Advancesโ€‰inโ€‰

Artificial Intelligence

Hindawiโ€‰Publishingโ€‰Corporationhttp://www.hindawi.com Volumeโ€‰2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014