research article performance optimization of 3d lattice ...ants.mju.ac.kr/publication/scientific...

17
Research Article Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU Nhat-Phuong Tran, Myungho Lee, and Sugwon Hong Department of Computer Science and Engineering, Myongji University, 116 Myongji-ro, Cheoin-gu, Yongin, Gyeonggi-do, Republic of Korea Correspondence should be addressed to Myungho Lee; [email protected] Received 9 June 2016; Accepted 23 October 2016; Published 16 January 2017 Academic Editor: Basilio B. Fraguela Copyright © 2017 Nhat-Phuong Tran et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. e LBM, however, is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. Furthermore, the main computation kernels of the LBM use a large number of registers per thread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU. In this paper, we develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalesced memory accesses while improving the cache locality using the tiling optimization with the data layout change. Furthermore, we aggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism. Experimental results on the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance: 1210.63 Million Lattice Updates Per Second (MLUPS). 1. Introduction Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow, originating from the lattice gas automata methods [1]. LBM models the fluid flow consisting of particles moving with random motions. Such particles exchange the momentum and the energy through the streaming and the collision processes over the discrete lattice grid in the discrete time steps. At each time step, the particles move into adjacent cells which cause collisions with the existing particles in the cells. e intrinsic data parallel nature of LBM makes this class of applications a promising candidate for parallel implementation on various High Per- formance Computing (HPC) architectures including many- core accelerators such as the Graphic Processing Unit (GPU) [2], Intel Xeon Phi [3], and the IBM Cell BE [4]. Recently, the GPU is becoming increasingly popular for the HPC server market and in the Top 500 list, in particular. e architecture of the GPU has gone through a number of innovative design changes in the last decade. It is integrated with a large number of cores and multiple threads per core, levels of the cache hierarchies, and the large amount (>5 GB) of the on-board memory. e peak floating- point throughput performance (flops) of the latest GPU has drastically increased to surpass 1 Tflops for the double precision arithmetic [5]. In addition to the architectural innovations, user friendly programming environments have been recently developed such as CUDA [5] from Nvidia, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Architecture Review Board (ARB). e advanced GPU architecture and the flexible programming environments have made possible innovative performance improvements in many application areas. In this paper, we develop high performance paralleliza- tion of the LBM on a GPU. e LBM is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming phase of the LBM incurs a lot of uncoalesced accesses on the GPU and affects the overall performance. Previous research focused on utilizing the shared memory of the GPU to deal with the problem [1, 8, 9]. In this paper, we use the tiling algorithm along with the data layout change in order to minimize the overheads Hindawi Publishing Corporation Scientific Programming Volume 2017, Article ID 1205892, 16 pages https://doi.org/10.1155/2017/1205892

Upload: others

Post on 30-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Research ArticlePerformance Optimization of 3D Lattice Boltzmann FlowSolver on a GPU

Nhat-Phuong Tran Myungho Lee and Sugwon Hong

Department of Computer Science and Engineering Myongji University 116 Myongji-ro Cheoin-gu YonginGyeonggi-do Republic of Korea

Correspondence should be addressed to Myungho Lee myungholmjuackr

Received 9 June 2016 Accepted 23 October 2016 Published 16 January 2017

Academic Editor Basilio B Fraguela

Copyright copy 2017 Nhat-Phuong Tran et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow With its data parallel nature it is apromising candidate for a parallel implementation on a GPU The LBM however is heavily data intensive and memory bound Inparticular moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on theGPUwhich affects the overall performance Furthermore the main computation kernels of the LBM use a large number of registers perthread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU In this paperwe develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalescedmemory accesses while improving the cache locality using the tiling optimization with the data layout change Furthermore weaggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism Experimental resultson the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance 121063Million Lattice UpdatesPer Second (MLUPS)

1 Introduction

Lattice Boltzmann Method (LBM) is a powerful numericalsimulation method of the fluid flow originating from thelattice gas automata methods [1] LBM models the fluid flowconsisting of particles moving with random motions Suchparticles exchange the momentum and the energy throughthe streaming and the collision processes over the discretelattice grid in the discrete time steps At each time step theparticles move into adjacent cells which cause collisions withthe existing particles in the cells The intrinsic data parallelnature of LBM makes this class of applications a promisingcandidate for parallel implementation on various High Per-formance Computing (HPC) architectures including many-core accelerators such as the Graphic Processing Unit (GPU)[2] Intel Xeon Phi [3] and the IBM Cell BE [4]

Recently the GPU is becoming increasingly popularfor the HPC server market and in the Top 500 list inparticular The architecture of the GPU has gone througha number of innovative design changes in the last decadeIt is integrated with a large number of cores and multiple

threads per core levels of the cache hierarchies and the largeamount (gt5GB) of the on-board memoryThe peak floating-point throughput performance (flops) of the latest GPUhas drastically increased to surpass 1 Tflops for the doubleprecision arithmetic [5] In addition to the architecturalinnovations user friendly programming environments havebeen recently developed such as CUDA [5] from NvidiaOpenCL [6] from Khronos Group and OpenACC [7] from asubgroup ofOpenMPArchitecture ReviewBoard (ARB)Theadvanced GPU architecture and the flexible programmingenvironments have made possible innovative performanceimprovements in many application areas

In this paper we develop high performance paralleliza-tion of the LBM on a GPUThe LBM is heavily data intensiveand memory bound In particular moving the data to theadjacent cells in the streaming phase of the LBM incursa lot of uncoalesced accesses on the GPU and affects theoverall performance Previous research focused on utilizingthe shared memory of the GPU to deal with the problem[1 8 9] In this paper we use the tiling algorithm along withthe data layout change in order to minimize the overheads

Hindawi Publishing CorporationScientific ProgrammingVolume 2017 Article ID 1205892 16 pageshttpsdoiorg10115520171205892

2 Scientific Programming

of the uncoalesced accesses and improve the cache localityas well The computation kernels of the LBM involve alarge number of floating-point variables thus using a largenumber of registers per thread This limits the availablethread parallelism generated at the run time as the totalnumber of the registers on the GPU is fixed We developedtechniques to aggressively reduce the register uses for thekernels in order to increase the available thread parallelismand the occupancy on the GPU Furthermore we developedtechniques to remove the branch divergence Our parallelimplementation using CUDA shows impressive performanceresults It delivers up to 121063 Million Lattice Update PerSecond (MLUPS) throughput performance and 136-timespeedup on theNvidia Tesla K20GPU comparedwith a serialimplementation

The rest of the paper is organized as follows Section 2introduces the LBM algorithm Section 3 describes the archi-tecture of the latest GPU and its programming modelSection 4 explains our techniques for minimizing the unco-alesced accesses improving the cache locality and the threadparallelism along with the register usage reduction and thebranch divergence removal Section 5 shows the experimen-tal results on the Nvidia Tesla K20 GPU Section 6 explains

the previous research on paralleling the LBM Section 7wrapsup the paper with conclusions

2 Lattice Boltzmann Method

Lattice Boltzmann Method (LBM) is a powerful numericalsimulation of the fluid flow It is derived as a special caseof the lattice gas cellular automata (LGCA) to simulate thefluid motion The fundamental idea is that the fluids can beregarded as consisting of a large number of small particlesmoving with random motions These particles exchange themomentum and the energy through the particle streamingand the particle collision The physical space of the LBMis discretized into a set of uniformly spaced nodes (lattice)At each node a discrete set of velocities is defined for thepropagation of the fluidmoleculesThe velocities are referredto as microscopic velocities which are denoted by 997888rarr119890119894 TheLBM model which has 119899 dimensions and 119902 velocity vectorsat each lattice point is represented as DnQq Figure 1 showsa typical lattice node of the most common model in 2D(D2Q9) which has two-dimensional 9 velocity vectors Inthis paper however we consider the D3Q19 model whichhas three-dimensional 19 velocity vectors Figure 2 shows atypical lattice node of D3Q19 model with 19 velocities 997888rarr119890119894defined by

997888rarr119890119894 =

(0 0 0) 119894 = 0(plusmn1 0 0) (0 plusmn1 0) (0 0 plusmn1) 119894 = 2 4 6 8 9 14(plusmn1 plusmn1 0) (0 plusmn1 plusmn1) (plusmn1 0 plusmn1) 119894 = 1 3 5 7 10 11 12 13 15 16 17 18

(1)

(i) Each particle on the lattice is associatedwith a discretedistribution function called as particle distributionfunction (pdf) 119891119894( 119905) 119894 = 0 18The LB equationis discretized as follows

119891119894 ( + 119888997888rarr119890119894Δ119905 119905 + Δ119905)

= 119891119894 ( 119905) minus 1120591 [119891119894 ( 119905) minus 119891(eq)119894 (120588 ( 119905) ( 119905))]

(2)

where 119888 is the lattice speed and 120591 is the relaxationparameter

(ii) The macroscopic quantities are the density 120588 and thevelocity ( 119905) They are defined as

120588 ( 119905) =18

sum119894=0

119891119894 ( 119905) (3)

( 119905) = 112058818

sum119894=0

119888119891119894997888rarr119890119894 (4)

(iii) The equilibrium function 119891(eq)119894 (120588( 119905) ( 119905)) is de-fined as

119891(eq)119894 (120588 ( 119905) ( 119905)) = 120596119894120588 + 120588119904119894 ( ( 119905)) (5)

where

119904119894 () = 120596119894 [1 + 31198882 (

997888rarr119890119894 sdot ) + 921198884 (

997888rarr119890119894 sdot )2 minus 321198882 sdot ] (6)

and the weighting factor 120596119894 has the following values

120596119894 =

13 120572 = 0118 120572 = 2 4 6 8 9 14136 120572 = 1 3 5 7 10 11 12 13 15 16 17 18

(7)

Algorithm 1 summarizes the algorithm of the LBM TheLBM algorithm executes a loop over a number of time stepsAt each iteration two computation steps are applied

(i) Streaming (or propagation) phase the particles moveaccording to the pdf into the adjacent cells

Scientific Programming 3

(1) Step 1 Initialize macroscopic quantities density 120588 velocity the distribution function 119891119894 and the equilibrium function 119891(eq)119894

(2) Step 2 Streaming phase move 119891119894 rarr 119891lowast119894 in the direction of 997888rarr119890119894(3) Step 3 Calculate density 120588 and velocity from 119891lowast119894 using Equations (3) and (4)(4) Step 4 Calculate the equilibrium function 119891(eq)119894 using Equation (5)(5) Step 5 Collision phase calculate the updated distribution function

119891119894 = 119891lowast119894 minus (1120591)(119891lowast119894 minus 119891eq119894 ) using Equation (2)

(6) Repeat Steps 2 to 5 119905119894119898119890119878119905119890119901119904-times

Algorithm 1 Algorithm of LBM

C

x

y

NW6 N2 NE5

E1

SE8S4SW7

W3

Figure 1 Lattice cell with 9 discrete directions in D2Q9model

Table 1 Pull and push schemes

Pull scheme Push scheme+ Read distribution functions + Read distribution functionsfrom the adjacent cells from the current cell 119891119894(997888rarr119909119894 119905)119891119894 (997888rarr119909119894 minus 119888997888rarr119890119894Δ119905 119905 minus Δ119905)+ Calculate 120588 119891(eq)119894 + Calculate 120588 119891(eq)119894+ Update values to the currentcell 119891119894(997888rarr119909119894 119905)

+ Update values to the adjacentcells 119891119894(997888rarr119909119894 + 119888997888rarr119890119894Δ119905 119905 + Δ119905)

(ii) Collision phase the particles collide with other parti-cles streaming into this cell from different directions

Depending on whether the streaming phase precedes orfollows the collision phase we have the pull or the pushscheme in the update process [10]The pull scheme (Figure 3)pulls the post-collision values from the previous time stepfrom lattice A and then performs the collision on these toproduce the new pdfs which are stored in lattice B In thepush scheme (Figure 4) on the other hand the pdfs of onenode (square with black arrows) are read from lattice A thencollision step performs first The post-collision values arepropagated to the neighbor nodes in the streaming step tolattice B (red arrows) Table 1 compares the computation stepsof these schemes

In Algorithm 2 we list the skeleton of the LBM algorithmwhich consists of the collision phase and the streaming phaseIn the function LBM the collide function and stream functionare called timeSteps-times At the end of each time step

y

x

z

TN11

TW12T9

TS13

TE10

NE1

E8

SE7

S6

BE15

BS18

B14

BN16

BW17

SW5

W4

NW3

C

N2

Figure 2 Lattice cell with 19 discrete directions in D3Q19model

(a) Pull lattice A (b) Pull lattice B

Figure 3 Illustration of pull scheme with D2Q9 model [11]

(a) Push lattice A (b) Push lattice B

Figure 4 Illustration of push scheme with D2Q9 model [11]

4 Scientific Programming

(1) void LBM(double source grid double

lowastdest grid int grid size int timeSteps)

(2) (3) int i

(4) double lowasttemp grid

(5) for (i = 0 i lt timeSteps i++)

(6) (7) collide(source grid temp grid grid size)

(8) stream(temp grid dest grid grid size)

(9) swap grid(source grid dest grid)

(10) (11)

Algorithm 2 Basic skeleton of LBM algorithm

119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 are swapped to interchange valuesbetween the two grids

3 Latest GPU Architecture

Recently the many-core accelerator chips are becomingincreasingly popular for the HPC applications The GPUchips from Nvidia and AMD are representative ones alongwith the Intel Xeon Phi The latest GPU architecture ischaracterized by a large number of uniform fine-grain pro-grammable cores or thread processors which have replacedseparate processing units for shader vertex and pixel inthe earlier GPUs Also the clock rate of the latest GPU hasramped up significantly These have drastically improved thefloating-point performance of theGPUs far exceeding that ofthe latest CPUs The fine-grain cores (or thread processors)are distributed in multiple streaming multiprocessors (SMX)(or thread blocks) (see Figure 5) Software threads are dividedinto a number of thread groups (called WARPs) each ofwhich consists of 32 threads Threads in the same WARP arescheduled and executed together on the thread processorsin the same SMX in the SIMD (Single Instruction MultipleData) mode Each thread executes the same instructiondirected by the common Instruction Unit on its own datastreaming from the device memory to the on-chip cachememories and registers When a running WARP encountersa cache miss for example the context is switched to a newWARP while the cache miss is serviced for the next fewhundred cycles the GPU executes in a multithreaded fashionas well

The GPU is built around a sophisticated memory hier-archy as shown in Figure 5 There are registers and localmemories belonging to each thread processor or core Thelocal memory is an area in the off-chip device memoryShared memory level-1 (L-1) cache and read-only data cacheare integrated in a thread block of the GPU The sharedmemory is a fast (as fast as registers) programmer-managedmemory Level-2 (L-2) cache is integrated on-chip and usedamong all the thread blocks Global memory is an area in theoff-chip device memory accessed from all the thread blocksthroughwhich theGPUcan communicatewith the host CPUData in the global memory get cached directly in the sharedmemory by the programmer or they can be cached through

GPU chip

Shared memory

Threadprocessor 1

Threadprocessor 2

Threadprocessor M Instruction

unitRegistersRegistersRegisters

Level-1 cache

Read only data cache

Level-2 cache

Device memory(Global memory local memory

texture memory and constant memory)

Thread block (streaming multiprocessor)-N

Thread block (streaming multiprocessor)-2

Thread block (streaming multiprocessor)-1

Figure 5 Architecture of a latest GPU (Nvidia Tesla K20)

the L-2 and L-1 caches automatically as they get accessedThere are constant memory and texture memory regions inthe device memory also Data in these regions is read-onlyThey can be cached in the L-2 cache and the read-only datacache On Nvidia Tesla K20 the read-only data from theglobalmemory can be loaded through the same cache used bythe texture pipeline via a standard pointer without the needto bind to a texture beforehand This read-only cache is usedautomatically by the compiler as long as certain conditionsare met restrict qualifier should be used when a variableis declared to help the compiler detect the conditions [5]

In order to efficiently utilize the latest advanced GPUarchitectures programming environments such as CUDA[5] from Nvidia OpenCL [6] from Khronos Group andOpenACC [7] from a subgroup of OpenMP ArchitectureReview Board (ARB) have been developed Using theseenvironments users can have a more direct control over the

Scientific Programming 5

(1) int i

(2) for(i = 0 i lt timeSteps i++)

(3) (4) collision kernelltltltGRID BLOCKgtgtgt

(source grid temp grid xdim ydim

zdim cell size grid size)

(5) cudaThreadSynchronize()

(6) streaming kernelltltltGRID BLOCKgtgtgt(temp grid

dest grid xdim ydim zdim cell size

grid size)

(7) cudaThreadSynchronize()

(8) swap grid(source grid dest grid)

(9)

Algorithm 3 Two separate CUDA kernels for different phases of LBM

large number of GPU cores and its sophisticated memoryhierarchy The flexible architecture and the programmingenvironments have led to a number of innovative perfor-mance improvements in many application areas and manymore are still to come

4 Optimizing Cache Locality andThread Parallelism

In this section we first introduce some preliminary stepswe employed in our parallelization and optimization of theLBM algorithm in Section 41 They are mostly borrowedfrom the previous research such as combining the collisionphase and the streaming phase a GPU architecture friendlydata organization scheme (SoA scheme) an efficient dataplacement in the GPU memory hierarchy and using thepull scheme for avoiding and minimizing the uncoalescedmemory accesses Then we describe our key optimizationtechniques for improving the cache locality and the threadparallelism such as the tiling with the data layout change andthe aggressive reduction of the register uses per thread inSections 42 and 43 Optimization techniques for removingthe branch divergence are presented in Section 44 Our keyoptimization techniques presented in this section have beenimproved from our earlier work in [13]

41 Preliminaries

411 Combination of Collision Phase and Streaming PhaseAs shown in the description of the LBM algorithm inAlgorithm 2 the LBM consists of the two main com-puting kernels 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 for the collision phaseand 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 for the streaming phase In the119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 threads load the particle distribution func-tion from the source grid (119904119900119906119903119888119890 119892119903119894119889) and then calcu-late the velocity the density and the collision productThe post-collision values are stored to the temporary grid(119905119890119898119901 119892119903119894119889) In 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 the post-collision valuesfrom 119905119890119898119901 119892119903119894119889 are loaded and updated to appropriateneighbor grid cells in the destination grid (119889119890119904119905 119892119903119894119889) Atthe end of each time step 119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 areswapped for the next time step This implementation (see

(1) int i

(2) for (i = 0 i lt timeSteps i++)

(3) (4) lbm kernelltltltGRID BLOCKgtgtgt(source grid

dest grid xdim ydim zdim cell size

grid size)

(5) cudaThreadSyncronize()

(6) swap grid(source grid dest grid)

(7)

Algorithm 4 Single CUDA kernel after combining two phases ofLBM

Algorithm 3) needs extra loadsstores fromto 119905119890119898119901 119892119903119894119889which is stored in the global memory and affects the globalmemory bandwidth [2] In addition some extra cost isincurred with the global synchronization between the twokernels (119888119906119889119886119879ℎ119903119890119886119889119878119910119899119888ℎ119903119900119899119894119911119890) which affects the overallperformance In order to reduce these overheads we cancombine 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 and 119904119905119903119890a119898119894119899119892 119896119890119903119899119890119897 into onekernel 119897119887119898 119896119890119903119899119890119897 where the collision product is streamedto the neighbor grid cells directly after calculation (seeAlgorithm 4) Compared with Algorithm 3 storing to andloading from 119905119890119898119901 119892119903119894119889 are removed and the global synchro-nization cost is reduced

412 Data Organization In order to represent the 3-dimensional grid of cells we use the 1-dimensional arraywhich has 119873119909 times 119873119910 times 119873119911 times 119876 elements where 119873119909 119873119910 119873119911are width height and depth of the grid and 119876 is the numberof directions of each cell [2 14] For example if the model is119863311987619 with119873119909 = 16119873119910 = 16 and119873119911 = 16 we have the 1Darray of 16times16times16times19 = 77824 elementsWe use 2 separatearrays for storing the source grid and the destination grid

There are two common data organization schemes forstoring the arrays

(i) Array of structures (AoS) grid cells are arrangedin 1D array 19 distributions of each cell occupy 19consecutive elements of the 1D array (Figure 6)

6 Scientific Programming

C N S WT WB C N S WT WB C N S WT WB

Cell 0 Cell 1

Thread 0 Thread 1

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

Cell (16 times 16 times 16 minus 1)

Figure 6 AoS scheme

C C C C C N N N N N WB WB WB WB WB

Cell 0 Cell 0 Cell 0Cell 1 Cell 1 Cell 1

Thread 0 Thread 0 Thread 0Thread 1 Thread 1 Thread 1

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

Cell(16 times 16 times 16) minus 1 Cell

(16 times 16 times 16) minus 1Cell

(16 times 16 times 16) minus 1

Figure 7 SoA scheme

(ii) Structure of arrays (SoA) the value of one distribu-tion of all cells is arranged consecutively in memory(Figure 7) This scheme is more suitable for the GPUarchitecture as we will show in the experimentalresults (Section 5)

413 Data Placement In order to efficiently utilize themem-ory hierarchy of the GPU the placement of the major datastructures of the LBM is crucial [5] In our implementationwe use the following arrays 119904119903119888 119892119903119894119889 119889119904119905 119892r119894119889 119905119910119901119890119904 119886119903119903119897119888 119886119903119903 and 119899119887 119886119903119903 We use the following data placements forthese arrays

(i) 119904119903119888 119892119903119894119889 and 119889119904119905 119892119903119894119889 are used to store the inputgrid and the result grid They are swapped at theend of each time step by exchanging their pointersinstead of explicit storing to and loading from thememory through a temporary array Since 119904119903119888 119892119903119894119889and 119889119904119905 119892119903119894119889 are very large size arrays with a lot ofdata stores and loads we place them in the globalmemory

(ii) In 119905119910119901119890119904 119886119903119903 array the types of the grid cells arestored We use the Lid Driven Cavity (LDC) as thetest case in this paper The LDC consists of a cubefilled with the fluid One side of the cube servesas the acceleration plane by sliding constantly Theacceleration is implemented by assigning the cellsin the acceleration area at a constant velocity Thischange requires three types of cells regular fluidacceleration cells or boundaryThus we also need 1Darray 119905119910119901119890119904 119886119903119903 in order to store the types of each cellin the grid The size of this array is 119873119909 times 119873119910 times 119873119911elements For example if the model is D3Q19 with119873119909 = 16119873119910 = 16 and119873119911 = 16 the size of the arrayis 16 times 16 times 16 = 4096 elements Thus 119905119910119901119890119904 119886119903119903 is

a large array also and contains constant values Thusthey are notmodified throughout the execution of theprogram For these reasons the texturememory is theright place for this array

(iii) 119897119888 119886119903119903 and 119899119887 119886119903119903 are used to store the base indicesfor accesses to 19 directions of the current cell andthe neighbor cells respectively There are 19 indicescorresponding to 19 directions ofD3Q19modelTheseindices are calculated at the start of the programand used till the end of the program executionThus we use the constant memory to store them Asstanding at any cell we use the following formula todefine the position in the 1D array of any directionout of 19 cell directions 119888119906119903119903 119889119894119903 119901119900119904 119894119899 119886119903119903 =119888119890119897119897 119901119900119904 + 119897119888 119886119903119903[119889119894119903119890119888119905119894119900119899] (for the current cell) and119899119887 119889119894119903 119901119900119904 119894119899 119886119903119903 = 119899119887 119901119900119904 + 119899119887 119886119903119903[119889119894119903119890119888119905119894119900119899] (forthe neighbor cells)

414 Using Pull Scheme to Reduce Costs for UncoalescedAccesses Coalescing the global memory accesses can signifi-cantly reduce the memory overheads on the GPU Multipleglobal memory loads whose addresses fall within the 128-byte range are combined into one request and sent tothe memory This saves the memory bandwidth a lot andimproves the performance In order to reduce the costs for theuncoalesced accesses we use the pull scheme [12] Choosingthe pull scheme comes from the observation that the costof the uncoalesced reading is smaller than the cost of theuncoalesced writing

Algorithm 5 shows the LBM algorithm using the pushscheme At the first step the pdfs are copied directly fromthe current cell These pdfs are used to calculate the pdfsat the new time step (collision phase) The new pdfs arethen streamed to the adjacent cells (streaming phase) At

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

2 Scientific Programming

of the uncoalesced accesses and improve the cache localityas well The computation kernels of the LBM involve alarge number of floating-point variables thus using a largenumber of registers per thread This limits the availablethread parallelism generated at the run time as the totalnumber of the registers on the GPU is fixed We developedtechniques to aggressively reduce the register uses for thekernels in order to increase the available thread parallelismand the occupancy on the GPU Furthermore we developedtechniques to remove the branch divergence Our parallelimplementation using CUDA shows impressive performanceresults It delivers up to 121063 Million Lattice Update PerSecond (MLUPS) throughput performance and 136-timespeedup on theNvidia Tesla K20GPU comparedwith a serialimplementation

The rest of the paper is organized as follows Section 2introduces the LBM algorithm Section 3 describes the archi-tecture of the latest GPU and its programming modelSection 4 explains our techniques for minimizing the unco-alesced accesses improving the cache locality and the threadparallelism along with the register usage reduction and thebranch divergence removal Section 5 shows the experimen-tal results on the Nvidia Tesla K20 GPU Section 6 explains

the previous research on paralleling the LBM Section 7wrapsup the paper with conclusions

2 Lattice Boltzmann Method

Lattice Boltzmann Method (LBM) is a powerful numericalsimulation of the fluid flow It is derived as a special caseof the lattice gas cellular automata (LGCA) to simulate thefluid motion The fundamental idea is that the fluids can beregarded as consisting of a large number of small particlesmoving with random motions These particles exchange themomentum and the energy through the particle streamingand the particle collision The physical space of the LBMis discretized into a set of uniformly spaced nodes (lattice)At each node a discrete set of velocities is defined for thepropagation of the fluidmoleculesThe velocities are referredto as microscopic velocities which are denoted by 997888rarr119890119894 TheLBM model which has 119899 dimensions and 119902 velocity vectorsat each lattice point is represented as DnQq Figure 1 showsa typical lattice node of the most common model in 2D(D2Q9) which has two-dimensional 9 velocity vectors Inthis paper however we consider the D3Q19 model whichhas three-dimensional 19 velocity vectors Figure 2 shows atypical lattice node of D3Q19 model with 19 velocities 997888rarr119890119894defined by

997888rarr119890119894 =

(0 0 0) 119894 = 0(plusmn1 0 0) (0 plusmn1 0) (0 0 plusmn1) 119894 = 2 4 6 8 9 14(plusmn1 plusmn1 0) (0 plusmn1 plusmn1) (plusmn1 0 plusmn1) 119894 = 1 3 5 7 10 11 12 13 15 16 17 18

(1)

(i) Each particle on the lattice is associatedwith a discretedistribution function called as particle distributionfunction (pdf) 119891119894( 119905) 119894 = 0 18The LB equationis discretized as follows

119891119894 ( + 119888997888rarr119890119894Δ119905 119905 + Δ119905)

= 119891119894 ( 119905) minus 1120591 [119891119894 ( 119905) minus 119891(eq)119894 (120588 ( 119905) ( 119905))]

(2)

where 119888 is the lattice speed and 120591 is the relaxationparameter

(ii) The macroscopic quantities are the density 120588 and thevelocity ( 119905) They are defined as

120588 ( 119905) =18

sum119894=0

119891119894 ( 119905) (3)

( 119905) = 112058818

sum119894=0

119888119891119894997888rarr119890119894 (4)

(iii) The equilibrium function 119891(eq)119894 (120588( 119905) ( 119905)) is de-fined as

119891(eq)119894 (120588 ( 119905) ( 119905)) = 120596119894120588 + 120588119904119894 ( ( 119905)) (5)

where

119904119894 () = 120596119894 [1 + 31198882 (

997888rarr119890119894 sdot ) + 921198884 (

997888rarr119890119894 sdot )2 minus 321198882 sdot ] (6)

and the weighting factor 120596119894 has the following values

120596119894 =

13 120572 = 0118 120572 = 2 4 6 8 9 14136 120572 = 1 3 5 7 10 11 12 13 15 16 17 18

(7)

Algorithm 1 summarizes the algorithm of the LBM TheLBM algorithm executes a loop over a number of time stepsAt each iteration two computation steps are applied

(i) Streaming (or propagation) phase the particles moveaccording to the pdf into the adjacent cells

Scientific Programming 3

(1) Step 1 Initialize macroscopic quantities density 120588 velocity the distribution function 119891119894 and the equilibrium function 119891(eq)119894

(2) Step 2 Streaming phase move 119891119894 rarr 119891lowast119894 in the direction of 997888rarr119890119894(3) Step 3 Calculate density 120588 and velocity from 119891lowast119894 using Equations (3) and (4)(4) Step 4 Calculate the equilibrium function 119891(eq)119894 using Equation (5)(5) Step 5 Collision phase calculate the updated distribution function

119891119894 = 119891lowast119894 minus (1120591)(119891lowast119894 minus 119891eq119894 ) using Equation (2)

(6) Repeat Steps 2 to 5 119905119894119898119890119878119905119890119901119904-times

Algorithm 1 Algorithm of LBM

C

x

y

NW6 N2 NE5

E1

SE8S4SW7

W3

Figure 1 Lattice cell with 9 discrete directions in D2Q9model

Table 1 Pull and push schemes

Pull scheme Push scheme+ Read distribution functions + Read distribution functionsfrom the adjacent cells from the current cell 119891119894(997888rarr119909119894 119905)119891119894 (997888rarr119909119894 minus 119888997888rarr119890119894Δ119905 119905 minus Δ119905)+ Calculate 120588 119891(eq)119894 + Calculate 120588 119891(eq)119894+ Update values to the currentcell 119891119894(997888rarr119909119894 119905)

+ Update values to the adjacentcells 119891119894(997888rarr119909119894 + 119888997888rarr119890119894Δ119905 119905 + Δ119905)

(ii) Collision phase the particles collide with other parti-cles streaming into this cell from different directions

Depending on whether the streaming phase precedes orfollows the collision phase we have the pull or the pushscheme in the update process [10]The pull scheme (Figure 3)pulls the post-collision values from the previous time stepfrom lattice A and then performs the collision on these toproduce the new pdfs which are stored in lattice B In thepush scheme (Figure 4) on the other hand the pdfs of onenode (square with black arrows) are read from lattice A thencollision step performs first The post-collision values arepropagated to the neighbor nodes in the streaming step tolattice B (red arrows) Table 1 compares the computation stepsof these schemes

In Algorithm 2 we list the skeleton of the LBM algorithmwhich consists of the collision phase and the streaming phaseIn the function LBM the collide function and stream functionare called timeSteps-times At the end of each time step

y

x

z

TN11

TW12T9

TS13

TE10

NE1

E8

SE7

S6

BE15

BS18

B14

BN16

BW17

SW5

W4

NW3

C

N2

Figure 2 Lattice cell with 19 discrete directions in D3Q19model

(a) Pull lattice A (b) Pull lattice B

Figure 3 Illustration of pull scheme with D2Q9 model [11]

(a) Push lattice A (b) Push lattice B

Figure 4 Illustration of push scheme with D2Q9 model [11]

4 Scientific Programming

(1) void LBM(double source grid double

lowastdest grid int grid size int timeSteps)

(2) (3) int i

(4) double lowasttemp grid

(5) for (i = 0 i lt timeSteps i++)

(6) (7) collide(source grid temp grid grid size)

(8) stream(temp grid dest grid grid size)

(9) swap grid(source grid dest grid)

(10) (11)

Algorithm 2 Basic skeleton of LBM algorithm

119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 are swapped to interchange valuesbetween the two grids

3 Latest GPU Architecture

Recently the many-core accelerator chips are becomingincreasingly popular for the HPC applications The GPUchips from Nvidia and AMD are representative ones alongwith the Intel Xeon Phi The latest GPU architecture ischaracterized by a large number of uniform fine-grain pro-grammable cores or thread processors which have replacedseparate processing units for shader vertex and pixel inthe earlier GPUs Also the clock rate of the latest GPU hasramped up significantly These have drastically improved thefloating-point performance of theGPUs far exceeding that ofthe latest CPUs The fine-grain cores (or thread processors)are distributed in multiple streaming multiprocessors (SMX)(or thread blocks) (see Figure 5) Software threads are dividedinto a number of thread groups (called WARPs) each ofwhich consists of 32 threads Threads in the same WARP arescheduled and executed together on the thread processorsin the same SMX in the SIMD (Single Instruction MultipleData) mode Each thread executes the same instructiondirected by the common Instruction Unit on its own datastreaming from the device memory to the on-chip cachememories and registers When a running WARP encountersa cache miss for example the context is switched to a newWARP while the cache miss is serviced for the next fewhundred cycles the GPU executes in a multithreaded fashionas well

The GPU is built around a sophisticated memory hier-archy as shown in Figure 5 There are registers and localmemories belonging to each thread processor or core Thelocal memory is an area in the off-chip device memoryShared memory level-1 (L-1) cache and read-only data cacheare integrated in a thread block of the GPU The sharedmemory is a fast (as fast as registers) programmer-managedmemory Level-2 (L-2) cache is integrated on-chip and usedamong all the thread blocks Global memory is an area in theoff-chip device memory accessed from all the thread blocksthroughwhich theGPUcan communicatewith the host CPUData in the global memory get cached directly in the sharedmemory by the programmer or they can be cached through

GPU chip

Shared memory

Threadprocessor 1

Threadprocessor 2

Threadprocessor M Instruction

unitRegistersRegistersRegisters

Level-1 cache

Read only data cache

Level-2 cache

Device memory(Global memory local memory

texture memory and constant memory)

Thread block (streaming multiprocessor)-N

Thread block (streaming multiprocessor)-2

Thread block (streaming multiprocessor)-1

Figure 5 Architecture of a latest GPU (Nvidia Tesla K20)

the L-2 and L-1 caches automatically as they get accessedThere are constant memory and texture memory regions inthe device memory also Data in these regions is read-onlyThey can be cached in the L-2 cache and the read-only datacache On Nvidia Tesla K20 the read-only data from theglobalmemory can be loaded through the same cache used bythe texture pipeline via a standard pointer without the needto bind to a texture beforehand This read-only cache is usedautomatically by the compiler as long as certain conditionsare met restrict qualifier should be used when a variableis declared to help the compiler detect the conditions [5]

In order to efficiently utilize the latest advanced GPUarchitectures programming environments such as CUDA[5] from Nvidia OpenCL [6] from Khronos Group andOpenACC [7] from a subgroup of OpenMP ArchitectureReview Board (ARB) have been developed Using theseenvironments users can have a more direct control over the

Scientific Programming 5

(1) int i

(2) for(i = 0 i lt timeSteps i++)

(3) (4) collision kernelltltltGRID BLOCKgtgtgt

(source grid temp grid xdim ydim

zdim cell size grid size)

(5) cudaThreadSynchronize()

(6) streaming kernelltltltGRID BLOCKgtgtgt(temp grid

dest grid xdim ydim zdim cell size

grid size)

(7) cudaThreadSynchronize()

(8) swap grid(source grid dest grid)

(9)

Algorithm 3 Two separate CUDA kernels for different phases of LBM

large number of GPU cores and its sophisticated memoryhierarchy The flexible architecture and the programmingenvironments have led to a number of innovative perfor-mance improvements in many application areas and manymore are still to come

4 Optimizing Cache Locality andThread Parallelism

In this section we first introduce some preliminary stepswe employed in our parallelization and optimization of theLBM algorithm in Section 41 They are mostly borrowedfrom the previous research such as combining the collisionphase and the streaming phase a GPU architecture friendlydata organization scheme (SoA scheme) an efficient dataplacement in the GPU memory hierarchy and using thepull scheme for avoiding and minimizing the uncoalescedmemory accesses Then we describe our key optimizationtechniques for improving the cache locality and the threadparallelism such as the tiling with the data layout change andthe aggressive reduction of the register uses per thread inSections 42 and 43 Optimization techniques for removingthe branch divergence are presented in Section 44 Our keyoptimization techniques presented in this section have beenimproved from our earlier work in [13]

41 Preliminaries

411 Combination of Collision Phase and Streaming PhaseAs shown in the description of the LBM algorithm inAlgorithm 2 the LBM consists of the two main com-puting kernels 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 for the collision phaseand 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 for the streaming phase In the119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 threads load the particle distribution func-tion from the source grid (119904119900119906119903119888119890 119892119903119894119889) and then calcu-late the velocity the density and the collision productThe post-collision values are stored to the temporary grid(119905119890119898119901 119892119903119894119889) In 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 the post-collision valuesfrom 119905119890119898119901 119892119903119894119889 are loaded and updated to appropriateneighbor grid cells in the destination grid (119889119890119904119905 119892119903119894119889) Atthe end of each time step 119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 areswapped for the next time step This implementation (see

(1) int i

(2) for (i = 0 i lt timeSteps i++)

(3) (4) lbm kernelltltltGRID BLOCKgtgtgt(source grid

dest grid xdim ydim zdim cell size

grid size)

(5) cudaThreadSyncronize()

(6) swap grid(source grid dest grid)

(7)

Algorithm 4 Single CUDA kernel after combining two phases ofLBM

Algorithm 3) needs extra loadsstores fromto 119905119890119898119901 119892119903119894119889which is stored in the global memory and affects the globalmemory bandwidth [2] In addition some extra cost isincurred with the global synchronization between the twokernels (119888119906119889119886119879ℎ119903119890119886119889119878119910119899119888ℎ119903119900119899119894119911119890) which affects the overallperformance In order to reduce these overheads we cancombine 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 and 119904119905119903119890a119898119894119899119892 119896119890119903119899119890119897 into onekernel 119897119887119898 119896119890119903119899119890119897 where the collision product is streamedto the neighbor grid cells directly after calculation (seeAlgorithm 4) Compared with Algorithm 3 storing to andloading from 119905119890119898119901 119892119903119894119889 are removed and the global synchro-nization cost is reduced

412 Data Organization In order to represent the 3-dimensional grid of cells we use the 1-dimensional arraywhich has 119873119909 times 119873119910 times 119873119911 times 119876 elements where 119873119909 119873119910 119873119911are width height and depth of the grid and 119876 is the numberof directions of each cell [2 14] For example if the model is119863311987619 with119873119909 = 16119873119910 = 16 and119873119911 = 16 we have the 1Darray of 16times16times16times19 = 77824 elementsWe use 2 separatearrays for storing the source grid and the destination grid

There are two common data organization schemes forstoring the arrays

(i) Array of structures (AoS) grid cells are arrangedin 1D array 19 distributions of each cell occupy 19consecutive elements of the 1D array (Figure 6)

6 Scientific Programming

C N S WT WB C N S WT WB C N S WT WB

Cell 0 Cell 1

Thread 0 Thread 1

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

Cell (16 times 16 times 16 minus 1)

Figure 6 AoS scheme

C C C C C N N N N N WB WB WB WB WB

Cell 0 Cell 0 Cell 0Cell 1 Cell 1 Cell 1

Thread 0 Thread 0 Thread 0Thread 1 Thread 1 Thread 1

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

Cell(16 times 16 times 16) minus 1 Cell

(16 times 16 times 16) minus 1Cell

(16 times 16 times 16) minus 1

Figure 7 SoA scheme

(ii) Structure of arrays (SoA) the value of one distribu-tion of all cells is arranged consecutively in memory(Figure 7) This scheme is more suitable for the GPUarchitecture as we will show in the experimentalresults (Section 5)

413 Data Placement In order to efficiently utilize themem-ory hierarchy of the GPU the placement of the major datastructures of the LBM is crucial [5] In our implementationwe use the following arrays 119904119903119888 119892119903119894119889 119889119904119905 119892r119894119889 119905119910119901119890119904 119886119903119903119897119888 119886119903119903 and 119899119887 119886119903119903 We use the following data placements forthese arrays

(i) 119904119903119888 119892119903119894119889 and 119889119904119905 119892119903119894119889 are used to store the inputgrid and the result grid They are swapped at theend of each time step by exchanging their pointersinstead of explicit storing to and loading from thememory through a temporary array Since 119904119903119888 119892119903119894119889and 119889119904119905 119892119903119894119889 are very large size arrays with a lot ofdata stores and loads we place them in the globalmemory

(ii) In 119905119910119901119890119904 119886119903119903 array the types of the grid cells arestored We use the Lid Driven Cavity (LDC) as thetest case in this paper The LDC consists of a cubefilled with the fluid One side of the cube servesas the acceleration plane by sliding constantly Theacceleration is implemented by assigning the cellsin the acceleration area at a constant velocity Thischange requires three types of cells regular fluidacceleration cells or boundaryThus we also need 1Darray 119905119910119901119890119904 119886119903119903 in order to store the types of each cellin the grid The size of this array is 119873119909 times 119873119910 times 119873119911elements For example if the model is D3Q19 with119873119909 = 16119873119910 = 16 and119873119911 = 16 the size of the arrayis 16 times 16 times 16 = 4096 elements Thus 119905119910119901119890119904 119886119903119903 is

a large array also and contains constant values Thusthey are notmodified throughout the execution of theprogram For these reasons the texturememory is theright place for this array

(iii) 119897119888 119886119903119903 and 119899119887 119886119903119903 are used to store the base indicesfor accesses to 19 directions of the current cell andthe neighbor cells respectively There are 19 indicescorresponding to 19 directions ofD3Q19modelTheseindices are calculated at the start of the programand used till the end of the program executionThus we use the constant memory to store them Asstanding at any cell we use the following formula todefine the position in the 1D array of any directionout of 19 cell directions 119888119906119903119903 119889119894119903 119901119900119904 119894119899 119886119903119903 =119888119890119897119897 119901119900119904 + 119897119888 119886119903119903[119889119894119903119890119888119905119894119900119899] (for the current cell) and119899119887 119889119894119903 119901119900119904 119894119899 119886119903119903 = 119899119887 119901119900119904 + 119899119887 119886119903119903[119889119894119903119890119888119905119894119900119899] (forthe neighbor cells)

414 Using Pull Scheme to Reduce Costs for UncoalescedAccesses Coalescing the global memory accesses can signifi-cantly reduce the memory overheads on the GPU Multipleglobal memory loads whose addresses fall within the 128-byte range are combined into one request and sent tothe memory This saves the memory bandwidth a lot andimproves the performance In order to reduce the costs for theuncoalesced accesses we use the pull scheme [12] Choosingthe pull scheme comes from the observation that the costof the uncoalesced reading is smaller than the cost of theuncoalesced writing

Algorithm 5 shows the LBM algorithm using the pushscheme At the first step the pdfs are copied directly fromthe current cell These pdfs are used to calculate the pdfsat the new time step (collision phase) The new pdfs arethen streamed to the adjacent cells (streaming phase) At

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 3

(1) Step 1 Initialize macroscopic quantities density 120588 velocity the distribution function 119891119894 and the equilibrium function 119891(eq)119894

(2) Step 2 Streaming phase move 119891119894 rarr 119891lowast119894 in the direction of 997888rarr119890119894(3) Step 3 Calculate density 120588 and velocity from 119891lowast119894 using Equations (3) and (4)(4) Step 4 Calculate the equilibrium function 119891(eq)119894 using Equation (5)(5) Step 5 Collision phase calculate the updated distribution function

119891119894 = 119891lowast119894 minus (1120591)(119891lowast119894 minus 119891eq119894 ) using Equation (2)

(6) Repeat Steps 2 to 5 119905119894119898119890119878119905119890119901119904-times

Algorithm 1 Algorithm of LBM

C

x

y

NW6 N2 NE5

E1

SE8S4SW7

W3

Figure 1 Lattice cell with 9 discrete directions in D2Q9model

Table 1 Pull and push schemes

Pull scheme Push scheme+ Read distribution functions + Read distribution functionsfrom the adjacent cells from the current cell 119891119894(997888rarr119909119894 119905)119891119894 (997888rarr119909119894 minus 119888997888rarr119890119894Δ119905 119905 minus Δ119905)+ Calculate 120588 119891(eq)119894 + Calculate 120588 119891(eq)119894+ Update values to the currentcell 119891119894(997888rarr119909119894 119905)

+ Update values to the adjacentcells 119891119894(997888rarr119909119894 + 119888997888rarr119890119894Δ119905 119905 + Δ119905)

(ii) Collision phase the particles collide with other parti-cles streaming into this cell from different directions

Depending on whether the streaming phase precedes orfollows the collision phase we have the pull or the pushscheme in the update process [10]The pull scheme (Figure 3)pulls the post-collision values from the previous time stepfrom lattice A and then performs the collision on these toproduce the new pdfs which are stored in lattice B In thepush scheme (Figure 4) on the other hand the pdfs of onenode (square with black arrows) are read from lattice A thencollision step performs first The post-collision values arepropagated to the neighbor nodes in the streaming step tolattice B (red arrows) Table 1 compares the computation stepsof these schemes

In Algorithm 2 we list the skeleton of the LBM algorithmwhich consists of the collision phase and the streaming phaseIn the function LBM the collide function and stream functionare called timeSteps-times At the end of each time step

y

x

z

TN11

TW12T9

TS13

TE10

NE1

E8

SE7

S6

BE15

BS18

B14

BN16

BW17

SW5

W4

NW3

C

N2

Figure 2 Lattice cell with 19 discrete directions in D3Q19model

(a) Pull lattice A (b) Pull lattice B

Figure 3 Illustration of pull scheme with D2Q9 model [11]

(a) Push lattice A (b) Push lattice B

Figure 4 Illustration of push scheme with D2Q9 model [11]

4 Scientific Programming

(1) void LBM(double source grid double

lowastdest grid int grid size int timeSteps)

(2) (3) int i

(4) double lowasttemp grid

(5) for (i = 0 i lt timeSteps i++)

(6) (7) collide(source grid temp grid grid size)

(8) stream(temp grid dest grid grid size)

(9) swap grid(source grid dest grid)

(10) (11)

Algorithm 2 Basic skeleton of LBM algorithm

119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 are swapped to interchange valuesbetween the two grids

3 Latest GPU Architecture

Recently the many-core accelerator chips are becomingincreasingly popular for the HPC applications The GPUchips from Nvidia and AMD are representative ones alongwith the Intel Xeon Phi The latest GPU architecture ischaracterized by a large number of uniform fine-grain pro-grammable cores or thread processors which have replacedseparate processing units for shader vertex and pixel inthe earlier GPUs Also the clock rate of the latest GPU hasramped up significantly These have drastically improved thefloating-point performance of theGPUs far exceeding that ofthe latest CPUs The fine-grain cores (or thread processors)are distributed in multiple streaming multiprocessors (SMX)(or thread blocks) (see Figure 5) Software threads are dividedinto a number of thread groups (called WARPs) each ofwhich consists of 32 threads Threads in the same WARP arescheduled and executed together on the thread processorsin the same SMX in the SIMD (Single Instruction MultipleData) mode Each thread executes the same instructiondirected by the common Instruction Unit on its own datastreaming from the device memory to the on-chip cachememories and registers When a running WARP encountersa cache miss for example the context is switched to a newWARP while the cache miss is serviced for the next fewhundred cycles the GPU executes in a multithreaded fashionas well

The GPU is built around a sophisticated memory hier-archy as shown in Figure 5 There are registers and localmemories belonging to each thread processor or core Thelocal memory is an area in the off-chip device memoryShared memory level-1 (L-1) cache and read-only data cacheare integrated in a thread block of the GPU The sharedmemory is a fast (as fast as registers) programmer-managedmemory Level-2 (L-2) cache is integrated on-chip and usedamong all the thread blocks Global memory is an area in theoff-chip device memory accessed from all the thread blocksthroughwhich theGPUcan communicatewith the host CPUData in the global memory get cached directly in the sharedmemory by the programmer or they can be cached through

GPU chip

Shared memory

Threadprocessor 1

Threadprocessor 2

Threadprocessor M Instruction

unitRegistersRegistersRegisters

Level-1 cache

Read only data cache

Level-2 cache

Device memory(Global memory local memory

texture memory and constant memory)

Thread block (streaming multiprocessor)-N

Thread block (streaming multiprocessor)-2

Thread block (streaming multiprocessor)-1

Figure 5 Architecture of a latest GPU (Nvidia Tesla K20)

the L-2 and L-1 caches automatically as they get accessedThere are constant memory and texture memory regions inthe device memory also Data in these regions is read-onlyThey can be cached in the L-2 cache and the read-only datacache On Nvidia Tesla K20 the read-only data from theglobalmemory can be loaded through the same cache used bythe texture pipeline via a standard pointer without the needto bind to a texture beforehand This read-only cache is usedautomatically by the compiler as long as certain conditionsare met restrict qualifier should be used when a variableis declared to help the compiler detect the conditions [5]

In order to efficiently utilize the latest advanced GPUarchitectures programming environments such as CUDA[5] from Nvidia OpenCL [6] from Khronos Group andOpenACC [7] from a subgroup of OpenMP ArchitectureReview Board (ARB) have been developed Using theseenvironments users can have a more direct control over the

Scientific Programming 5

(1) int i

(2) for(i = 0 i lt timeSteps i++)

(3) (4) collision kernelltltltGRID BLOCKgtgtgt

(source grid temp grid xdim ydim

zdim cell size grid size)

(5) cudaThreadSynchronize()

(6) streaming kernelltltltGRID BLOCKgtgtgt(temp grid

dest grid xdim ydim zdim cell size

grid size)

(7) cudaThreadSynchronize()

(8) swap grid(source grid dest grid)

(9)

Algorithm 3 Two separate CUDA kernels for different phases of LBM

large number of GPU cores and its sophisticated memoryhierarchy The flexible architecture and the programmingenvironments have led to a number of innovative perfor-mance improvements in many application areas and manymore are still to come

4 Optimizing Cache Locality andThread Parallelism

In this section we first introduce some preliminary stepswe employed in our parallelization and optimization of theLBM algorithm in Section 41 They are mostly borrowedfrom the previous research such as combining the collisionphase and the streaming phase a GPU architecture friendlydata organization scheme (SoA scheme) an efficient dataplacement in the GPU memory hierarchy and using thepull scheme for avoiding and minimizing the uncoalescedmemory accesses Then we describe our key optimizationtechniques for improving the cache locality and the threadparallelism such as the tiling with the data layout change andthe aggressive reduction of the register uses per thread inSections 42 and 43 Optimization techniques for removingthe branch divergence are presented in Section 44 Our keyoptimization techniques presented in this section have beenimproved from our earlier work in [13]

41 Preliminaries

411 Combination of Collision Phase and Streaming PhaseAs shown in the description of the LBM algorithm inAlgorithm 2 the LBM consists of the two main com-puting kernels 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 for the collision phaseand 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 for the streaming phase In the119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 threads load the particle distribution func-tion from the source grid (119904119900119906119903119888119890 119892119903119894119889) and then calcu-late the velocity the density and the collision productThe post-collision values are stored to the temporary grid(119905119890119898119901 119892119903119894119889) In 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 the post-collision valuesfrom 119905119890119898119901 119892119903119894119889 are loaded and updated to appropriateneighbor grid cells in the destination grid (119889119890119904119905 119892119903119894119889) Atthe end of each time step 119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 areswapped for the next time step This implementation (see

(1) int i

(2) for (i = 0 i lt timeSteps i++)

(3) (4) lbm kernelltltltGRID BLOCKgtgtgt(source grid

dest grid xdim ydim zdim cell size

grid size)

(5) cudaThreadSyncronize()

(6) swap grid(source grid dest grid)

(7)

Algorithm 4 Single CUDA kernel after combining two phases ofLBM

Algorithm 3) needs extra loadsstores fromto 119905119890119898119901 119892119903119894119889which is stored in the global memory and affects the globalmemory bandwidth [2] In addition some extra cost isincurred with the global synchronization between the twokernels (119888119906119889119886119879ℎ119903119890119886119889119878119910119899119888ℎ119903119900119899119894119911119890) which affects the overallperformance In order to reduce these overheads we cancombine 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 and 119904119905119903119890a119898119894119899119892 119896119890119903119899119890119897 into onekernel 119897119887119898 119896119890119903119899119890119897 where the collision product is streamedto the neighbor grid cells directly after calculation (seeAlgorithm 4) Compared with Algorithm 3 storing to andloading from 119905119890119898119901 119892119903119894119889 are removed and the global synchro-nization cost is reduced

412 Data Organization In order to represent the 3-dimensional grid of cells we use the 1-dimensional arraywhich has 119873119909 times 119873119910 times 119873119911 times 119876 elements where 119873119909 119873119910 119873119911are width height and depth of the grid and 119876 is the numberof directions of each cell [2 14] For example if the model is119863311987619 with119873119909 = 16119873119910 = 16 and119873119911 = 16 we have the 1Darray of 16times16times16times19 = 77824 elementsWe use 2 separatearrays for storing the source grid and the destination grid

There are two common data organization schemes forstoring the arrays

(i) Array of structures (AoS) grid cells are arrangedin 1D array 19 distributions of each cell occupy 19consecutive elements of the 1D array (Figure 6)

6 Scientific Programming

C N S WT WB C N S WT WB C N S WT WB

Cell 0 Cell 1

Thread 0 Thread 1

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

Cell (16 times 16 times 16 minus 1)

Figure 6 AoS scheme

C C C C C N N N N N WB WB WB WB WB

Cell 0 Cell 0 Cell 0Cell 1 Cell 1 Cell 1

Thread 0 Thread 0 Thread 0Thread 1 Thread 1 Thread 1

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

Cell(16 times 16 times 16) minus 1 Cell

(16 times 16 times 16) minus 1Cell

(16 times 16 times 16) minus 1

Figure 7 SoA scheme

(ii) Structure of arrays (SoA) the value of one distribu-tion of all cells is arranged consecutively in memory(Figure 7) This scheme is more suitable for the GPUarchitecture as we will show in the experimentalresults (Section 5)

413 Data Placement In order to efficiently utilize themem-ory hierarchy of the GPU the placement of the major datastructures of the LBM is crucial [5] In our implementationwe use the following arrays 119904119903119888 119892119903119894119889 119889119904119905 119892r119894119889 119905119910119901119890119904 119886119903119903119897119888 119886119903119903 and 119899119887 119886119903119903 We use the following data placements forthese arrays

(i) 119904119903119888 119892119903119894119889 and 119889119904119905 119892119903119894119889 are used to store the inputgrid and the result grid They are swapped at theend of each time step by exchanging their pointersinstead of explicit storing to and loading from thememory through a temporary array Since 119904119903119888 119892119903119894119889and 119889119904119905 119892119903119894119889 are very large size arrays with a lot ofdata stores and loads we place them in the globalmemory

(ii) In 119905119910119901119890119904 119886119903119903 array the types of the grid cells arestored We use the Lid Driven Cavity (LDC) as thetest case in this paper The LDC consists of a cubefilled with the fluid One side of the cube servesas the acceleration plane by sliding constantly Theacceleration is implemented by assigning the cellsin the acceleration area at a constant velocity Thischange requires three types of cells regular fluidacceleration cells or boundaryThus we also need 1Darray 119905119910119901119890119904 119886119903119903 in order to store the types of each cellin the grid The size of this array is 119873119909 times 119873119910 times 119873119911elements For example if the model is D3Q19 with119873119909 = 16119873119910 = 16 and119873119911 = 16 the size of the arrayis 16 times 16 times 16 = 4096 elements Thus 119905119910119901119890119904 119886119903119903 is

a large array also and contains constant values Thusthey are notmodified throughout the execution of theprogram For these reasons the texturememory is theright place for this array

(iii) 119897119888 119886119903119903 and 119899119887 119886119903119903 are used to store the base indicesfor accesses to 19 directions of the current cell andthe neighbor cells respectively There are 19 indicescorresponding to 19 directions ofD3Q19modelTheseindices are calculated at the start of the programand used till the end of the program executionThus we use the constant memory to store them Asstanding at any cell we use the following formula todefine the position in the 1D array of any directionout of 19 cell directions 119888119906119903119903 119889119894119903 119901119900119904 119894119899 119886119903119903 =119888119890119897119897 119901119900119904 + 119897119888 119886119903119903[119889119894119903119890119888119905119894119900119899] (for the current cell) and119899119887 119889119894119903 119901119900119904 119894119899 119886119903119903 = 119899119887 119901119900119904 + 119899119887 119886119903119903[119889119894119903119890119888119905119894119900119899] (forthe neighbor cells)

414 Using Pull Scheme to Reduce Costs for UncoalescedAccesses Coalescing the global memory accesses can signifi-cantly reduce the memory overheads on the GPU Multipleglobal memory loads whose addresses fall within the 128-byte range are combined into one request and sent tothe memory This saves the memory bandwidth a lot andimproves the performance In order to reduce the costs for theuncoalesced accesses we use the pull scheme [12] Choosingthe pull scheme comes from the observation that the costof the uncoalesced reading is smaller than the cost of theuncoalesced writing

Algorithm 5 shows the LBM algorithm using the pushscheme At the first step the pdfs are copied directly fromthe current cell These pdfs are used to calculate the pdfsat the new time step (collision phase) The new pdfs arethen streamed to the adjacent cells (streaming phase) At

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

4 Scientific Programming

(1) void LBM(double source grid double

lowastdest grid int grid size int timeSteps)

(2) (3) int i

(4) double lowasttemp grid

(5) for (i = 0 i lt timeSteps i++)

(6) (7) collide(source grid temp grid grid size)

(8) stream(temp grid dest grid grid size)

(9) swap grid(source grid dest grid)

(10) (11)

Algorithm 2 Basic skeleton of LBM algorithm

119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 are swapped to interchange valuesbetween the two grids

3 Latest GPU Architecture

Recently the many-core accelerator chips are becomingincreasingly popular for the HPC applications The GPUchips from Nvidia and AMD are representative ones alongwith the Intel Xeon Phi The latest GPU architecture ischaracterized by a large number of uniform fine-grain pro-grammable cores or thread processors which have replacedseparate processing units for shader vertex and pixel inthe earlier GPUs Also the clock rate of the latest GPU hasramped up significantly These have drastically improved thefloating-point performance of theGPUs far exceeding that ofthe latest CPUs The fine-grain cores (or thread processors)are distributed in multiple streaming multiprocessors (SMX)(or thread blocks) (see Figure 5) Software threads are dividedinto a number of thread groups (called WARPs) each ofwhich consists of 32 threads Threads in the same WARP arescheduled and executed together on the thread processorsin the same SMX in the SIMD (Single Instruction MultipleData) mode Each thread executes the same instructiondirected by the common Instruction Unit on its own datastreaming from the device memory to the on-chip cachememories and registers When a running WARP encountersa cache miss for example the context is switched to a newWARP while the cache miss is serviced for the next fewhundred cycles the GPU executes in a multithreaded fashionas well

The GPU is built around a sophisticated memory hier-archy as shown in Figure 5 There are registers and localmemories belonging to each thread processor or core Thelocal memory is an area in the off-chip device memoryShared memory level-1 (L-1) cache and read-only data cacheare integrated in a thread block of the GPU The sharedmemory is a fast (as fast as registers) programmer-managedmemory Level-2 (L-2) cache is integrated on-chip and usedamong all the thread blocks Global memory is an area in theoff-chip device memory accessed from all the thread blocksthroughwhich theGPUcan communicatewith the host CPUData in the global memory get cached directly in the sharedmemory by the programmer or they can be cached through

GPU chip

Shared memory

Threadprocessor 1

Threadprocessor 2

Threadprocessor M Instruction

unitRegistersRegistersRegisters

Level-1 cache

Read only data cache

Level-2 cache

Device memory(Global memory local memory

texture memory and constant memory)

Thread block (streaming multiprocessor)-N

Thread block (streaming multiprocessor)-2

Thread block (streaming multiprocessor)-1

Figure 5 Architecture of a latest GPU (Nvidia Tesla K20)

the L-2 and L-1 caches automatically as they get accessedThere are constant memory and texture memory regions inthe device memory also Data in these regions is read-onlyThey can be cached in the L-2 cache and the read-only datacache On Nvidia Tesla K20 the read-only data from theglobalmemory can be loaded through the same cache used bythe texture pipeline via a standard pointer without the needto bind to a texture beforehand This read-only cache is usedautomatically by the compiler as long as certain conditionsare met restrict qualifier should be used when a variableis declared to help the compiler detect the conditions [5]

In order to efficiently utilize the latest advanced GPUarchitectures programming environments such as CUDA[5] from Nvidia OpenCL [6] from Khronos Group andOpenACC [7] from a subgroup of OpenMP ArchitectureReview Board (ARB) have been developed Using theseenvironments users can have a more direct control over the

Scientific Programming 5

(1) int i

(2) for(i = 0 i lt timeSteps i++)

(3) (4) collision kernelltltltGRID BLOCKgtgtgt

(source grid temp grid xdim ydim

zdim cell size grid size)

(5) cudaThreadSynchronize()

(6) streaming kernelltltltGRID BLOCKgtgtgt(temp grid

dest grid xdim ydim zdim cell size

grid size)

(7) cudaThreadSynchronize()

(8) swap grid(source grid dest grid)

(9)

Algorithm 3 Two separate CUDA kernels for different phases of LBM

large number of GPU cores and its sophisticated memoryhierarchy The flexible architecture and the programmingenvironments have led to a number of innovative perfor-mance improvements in many application areas and manymore are still to come

4 Optimizing Cache Locality andThread Parallelism

In this section we first introduce some preliminary stepswe employed in our parallelization and optimization of theLBM algorithm in Section 41 They are mostly borrowedfrom the previous research such as combining the collisionphase and the streaming phase a GPU architecture friendlydata organization scheme (SoA scheme) an efficient dataplacement in the GPU memory hierarchy and using thepull scheme for avoiding and minimizing the uncoalescedmemory accesses Then we describe our key optimizationtechniques for improving the cache locality and the threadparallelism such as the tiling with the data layout change andthe aggressive reduction of the register uses per thread inSections 42 and 43 Optimization techniques for removingthe branch divergence are presented in Section 44 Our keyoptimization techniques presented in this section have beenimproved from our earlier work in [13]

41 Preliminaries

411 Combination of Collision Phase and Streaming PhaseAs shown in the description of the LBM algorithm inAlgorithm 2 the LBM consists of the two main com-puting kernels 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 for the collision phaseand 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 for the streaming phase In the119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 threads load the particle distribution func-tion from the source grid (119904119900119906119903119888119890 119892119903119894119889) and then calcu-late the velocity the density and the collision productThe post-collision values are stored to the temporary grid(119905119890119898119901 119892119903119894119889) In 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 the post-collision valuesfrom 119905119890119898119901 119892119903119894119889 are loaded and updated to appropriateneighbor grid cells in the destination grid (119889119890119904119905 119892119903119894119889) Atthe end of each time step 119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 areswapped for the next time step This implementation (see

(1) int i

(2) for (i = 0 i lt timeSteps i++)

(3) (4) lbm kernelltltltGRID BLOCKgtgtgt(source grid

dest grid xdim ydim zdim cell size

grid size)

(5) cudaThreadSyncronize()

(6) swap grid(source grid dest grid)

(7)

Algorithm 4 Single CUDA kernel after combining two phases ofLBM

Algorithm 3) needs extra loadsstores fromto 119905119890119898119901 119892119903119894119889which is stored in the global memory and affects the globalmemory bandwidth [2] In addition some extra cost isincurred with the global synchronization between the twokernels (119888119906119889119886119879ℎ119903119890119886119889119878119910119899119888ℎ119903119900119899119894119911119890) which affects the overallperformance In order to reduce these overheads we cancombine 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 and 119904119905119903119890a119898119894119899119892 119896119890119903119899119890119897 into onekernel 119897119887119898 119896119890119903119899119890119897 where the collision product is streamedto the neighbor grid cells directly after calculation (seeAlgorithm 4) Compared with Algorithm 3 storing to andloading from 119905119890119898119901 119892119903119894119889 are removed and the global synchro-nization cost is reduced

412 Data Organization In order to represent the 3-dimensional grid of cells we use the 1-dimensional arraywhich has 119873119909 times 119873119910 times 119873119911 times 119876 elements where 119873119909 119873119910 119873119911are width height and depth of the grid and 119876 is the numberof directions of each cell [2 14] For example if the model is119863311987619 with119873119909 = 16119873119910 = 16 and119873119911 = 16 we have the 1Darray of 16times16times16times19 = 77824 elementsWe use 2 separatearrays for storing the source grid and the destination grid

There are two common data organization schemes forstoring the arrays

(i) Array of structures (AoS) grid cells are arrangedin 1D array 19 distributions of each cell occupy 19consecutive elements of the 1D array (Figure 6)

6 Scientific Programming

C N S WT WB C N S WT WB C N S WT WB

Cell 0 Cell 1

Thread 0 Thread 1

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

Cell (16 times 16 times 16 minus 1)

Figure 6 AoS scheme

C C C C C N N N N N WB WB WB WB WB

Cell 0 Cell 0 Cell 0Cell 1 Cell 1 Cell 1

Thread 0 Thread 0 Thread 0Thread 1 Thread 1 Thread 1

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

Cell(16 times 16 times 16) minus 1 Cell

(16 times 16 times 16) minus 1Cell

(16 times 16 times 16) minus 1

Figure 7 SoA scheme

(ii) Structure of arrays (SoA) the value of one distribu-tion of all cells is arranged consecutively in memory(Figure 7) This scheme is more suitable for the GPUarchitecture as we will show in the experimentalresults (Section 5)

413 Data Placement In order to efficiently utilize themem-ory hierarchy of the GPU the placement of the major datastructures of the LBM is crucial [5] In our implementationwe use the following arrays 119904119903119888 119892119903119894119889 119889119904119905 119892r119894119889 119905119910119901119890119904 119886119903119903119897119888 119886119903119903 and 119899119887 119886119903119903 We use the following data placements forthese arrays

(i) 119904119903119888 119892119903119894119889 and 119889119904119905 119892119903119894119889 are used to store the inputgrid and the result grid They are swapped at theend of each time step by exchanging their pointersinstead of explicit storing to and loading from thememory through a temporary array Since 119904119903119888 119892119903119894119889and 119889119904119905 119892119903119894119889 are very large size arrays with a lot ofdata stores and loads we place them in the globalmemory

(ii) In 119905119910119901119890119904 119886119903119903 array the types of the grid cells arestored We use the Lid Driven Cavity (LDC) as thetest case in this paper The LDC consists of a cubefilled with the fluid One side of the cube servesas the acceleration plane by sliding constantly Theacceleration is implemented by assigning the cellsin the acceleration area at a constant velocity Thischange requires three types of cells regular fluidacceleration cells or boundaryThus we also need 1Darray 119905119910119901119890119904 119886119903119903 in order to store the types of each cellin the grid The size of this array is 119873119909 times 119873119910 times 119873119911elements For example if the model is D3Q19 with119873119909 = 16119873119910 = 16 and119873119911 = 16 the size of the arrayis 16 times 16 times 16 = 4096 elements Thus 119905119910119901119890119904 119886119903119903 is

a large array also and contains constant values Thusthey are notmodified throughout the execution of theprogram For these reasons the texturememory is theright place for this array

(iii) 119897119888 119886119903119903 and 119899119887 119886119903119903 are used to store the base indicesfor accesses to 19 directions of the current cell andthe neighbor cells respectively There are 19 indicescorresponding to 19 directions ofD3Q19modelTheseindices are calculated at the start of the programand used till the end of the program executionThus we use the constant memory to store them Asstanding at any cell we use the following formula todefine the position in the 1D array of any directionout of 19 cell directions 119888119906119903119903 119889119894119903 119901119900119904 119894119899 119886119903119903 =119888119890119897119897 119901119900119904 + 119897119888 119886119903119903[119889119894119903119890119888119905119894119900119899] (for the current cell) and119899119887 119889119894119903 119901119900119904 119894119899 119886119903119903 = 119899119887 119901119900119904 + 119899119887 119886119903119903[119889119894119903119890119888119905119894119900119899] (forthe neighbor cells)

414 Using Pull Scheme to Reduce Costs for UncoalescedAccesses Coalescing the global memory accesses can signifi-cantly reduce the memory overheads on the GPU Multipleglobal memory loads whose addresses fall within the 128-byte range are combined into one request and sent tothe memory This saves the memory bandwidth a lot andimproves the performance In order to reduce the costs for theuncoalesced accesses we use the pull scheme [12] Choosingthe pull scheme comes from the observation that the costof the uncoalesced reading is smaller than the cost of theuncoalesced writing

Algorithm 5 shows the LBM algorithm using the pushscheme At the first step the pdfs are copied directly fromthe current cell These pdfs are used to calculate the pdfsat the new time step (collision phase) The new pdfs arethen streamed to the adjacent cells (streaming phase) At

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 5

(1) int i

(2) for(i = 0 i lt timeSteps i++)

(3) (4) collision kernelltltltGRID BLOCKgtgtgt

(source grid temp grid xdim ydim

zdim cell size grid size)

(5) cudaThreadSynchronize()

(6) streaming kernelltltltGRID BLOCKgtgtgt(temp grid

dest grid xdim ydim zdim cell size

grid size)

(7) cudaThreadSynchronize()

(8) swap grid(source grid dest grid)

(9)

Algorithm 3 Two separate CUDA kernels for different phases of LBM

large number of GPU cores and its sophisticated memoryhierarchy The flexible architecture and the programmingenvironments have led to a number of innovative perfor-mance improvements in many application areas and manymore are still to come

4 Optimizing Cache Locality andThread Parallelism

In this section we first introduce some preliminary stepswe employed in our parallelization and optimization of theLBM algorithm in Section 41 They are mostly borrowedfrom the previous research such as combining the collisionphase and the streaming phase a GPU architecture friendlydata organization scheme (SoA scheme) an efficient dataplacement in the GPU memory hierarchy and using thepull scheme for avoiding and minimizing the uncoalescedmemory accesses Then we describe our key optimizationtechniques for improving the cache locality and the threadparallelism such as the tiling with the data layout change andthe aggressive reduction of the register uses per thread inSections 42 and 43 Optimization techniques for removingthe branch divergence are presented in Section 44 Our keyoptimization techniques presented in this section have beenimproved from our earlier work in [13]

41 Preliminaries

411 Combination of Collision Phase and Streaming PhaseAs shown in the description of the LBM algorithm inAlgorithm 2 the LBM consists of the two main com-puting kernels 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 for the collision phaseand 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 for the streaming phase In the119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 threads load the particle distribution func-tion from the source grid (119904119900119906119903119888119890 119892119903119894119889) and then calcu-late the velocity the density and the collision productThe post-collision values are stored to the temporary grid(119905119890119898119901 119892119903119894119889) In 119904119905119903119890119886119898119894119899119892 119896119890119903119899119890119897 the post-collision valuesfrom 119905119890119898119901 119892119903119894119889 are loaded and updated to appropriateneighbor grid cells in the destination grid (119889119890119904119905 119892119903119894119889) Atthe end of each time step 119904119900119906119903119888119890 119892119903119894119889 and 119889119890119904119905 119892119903119894119889 areswapped for the next time step This implementation (see

(1) int i

(2) for (i = 0 i lt timeSteps i++)

(3) (4) lbm kernelltltltGRID BLOCKgtgtgt(source grid

dest grid xdim ydim zdim cell size

grid size)

(5) cudaThreadSyncronize()

(6) swap grid(source grid dest grid)

(7)

Algorithm 4 Single CUDA kernel after combining two phases ofLBM

Algorithm 3) needs extra loadsstores fromto 119905119890119898119901 119892119903119894119889which is stored in the global memory and affects the globalmemory bandwidth [2] In addition some extra cost isincurred with the global synchronization between the twokernels (119888119906119889119886119879ℎ119903119890119886119889119878119910119899119888ℎ119903119900119899119894119911119890) which affects the overallperformance In order to reduce these overheads we cancombine 119888119900119897119897119894119904119894119900119899 119896119890119903119899119890119897 and 119904119905119903119890a119898119894119899119892 119896119890119903119899119890119897 into onekernel 119897119887119898 119896119890119903119899119890119897 where the collision product is streamedto the neighbor grid cells directly after calculation (seeAlgorithm 4) Compared with Algorithm 3 storing to andloading from 119905119890119898119901 119892119903119894119889 are removed and the global synchro-nization cost is reduced

412 Data Organization In order to represent the 3-dimensional grid of cells we use the 1-dimensional arraywhich has 119873119909 times 119873119910 times 119873119911 times 119876 elements where 119873119909 119873119910 119873119911are width height and depth of the grid and 119876 is the numberof directions of each cell [2 14] For example if the model is119863311987619 with119873119909 = 16119873119910 = 16 and119873119911 = 16 we have the 1Darray of 16times16times16times19 = 77824 elementsWe use 2 separatearrays for storing the source grid and the destination grid

There are two common data organization schemes forstoring the arrays

(i) Array of structures (AoS) grid cells are arrangedin 1D array 19 distributions of each cell occupy 19consecutive elements of the 1D array (Figure 6)

6 Scientific Programming

C N S WT WB C N S WT WB C N S WT WB

Cell 0 Cell 1

Thread 0 Thread 1

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

Cell (16 times 16 times 16 minus 1)

Figure 6 AoS scheme

C C C C C N N N N N WB WB WB WB WB

Cell 0 Cell 0 Cell 0Cell 1 Cell 1 Cell 1

Thread 0 Thread 0 Thread 0Thread 1 Thread 1 Thread 1

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

Cell(16 times 16 times 16) minus 1 Cell

(16 times 16 times 16) minus 1Cell

(16 times 16 times 16) minus 1

Figure 7 SoA scheme

(ii) Structure of arrays (SoA) the value of one distribu-tion of all cells is arranged consecutively in memory(Figure 7) This scheme is more suitable for the GPUarchitecture as we will show in the experimentalresults (Section 5)

413 Data Placement In order to efficiently utilize themem-ory hierarchy of the GPU the placement of the major datastructures of the LBM is crucial [5] In our implementationwe use the following arrays 119904119903119888 119892119903119894119889 119889119904119905 119892r119894119889 119905119910119901119890119904 119886119903119903119897119888 119886119903119903 and 119899119887 119886119903119903 We use the following data placements forthese arrays

(i) 119904119903119888 119892119903119894119889 and 119889119904119905 119892119903119894119889 are used to store the inputgrid and the result grid They are swapped at theend of each time step by exchanging their pointersinstead of explicit storing to and loading from thememory through a temporary array Since 119904119903119888 119892119903119894119889and 119889119904119905 119892119903119894119889 are very large size arrays with a lot ofdata stores and loads we place them in the globalmemory

(ii) In 119905119910119901119890119904 119886119903119903 array the types of the grid cells arestored We use the Lid Driven Cavity (LDC) as thetest case in this paper The LDC consists of a cubefilled with the fluid One side of the cube servesas the acceleration plane by sliding constantly Theacceleration is implemented by assigning the cellsin the acceleration area at a constant velocity Thischange requires three types of cells regular fluidacceleration cells or boundaryThus we also need 1Darray 119905119910119901119890119904 119886119903119903 in order to store the types of each cellin the grid The size of this array is 119873119909 times 119873119910 times 119873119911elements For example if the model is D3Q19 with119873119909 = 16119873119910 = 16 and119873119911 = 16 the size of the arrayis 16 times 16 times 16 = 4096 elements Thus 119905119910119901119890119904 119886119903119903 is

a large array also and contains constant values Thusthey are notmodified throughout the execution of theprogram For these reasons the texturememory is theright place for this array

(iii) 119897119888 119886119903119903 and 119899119887 119886119903119903 are used to store the base indicesfor accesses to 19 directions of the current cell andthe neighbor cells respectively There are 19 indicescorresponding to 19 directions ofD3Q19modelTheseindices are calculated at the start of the programand used till the end of the program executionThus we use the constant memory to store them Asstanding at any cell we use the following formula todefine the position in the 1D array of any directionout of 19 cell directions 119888119906119903119903 119889119894119903 119901119900119904 119894119899 119886119903119903 =119888119890119897119897 119901119900119904 + 119897119888 119886119903119903[119889119894119903119890119888119905119894119900119899] (for the current cell) and119899119887 119889119894119903 119901119900119904 119894119899 119886119903119903 = 119899119887 119901119900119904 + 119899119887 119886119903119903[119889119894119903119890119888119905119894119900119899] (forthe neighbor cells)

414 Using Pull Scheme to Reduce Costs for UncoalescedAccesses Coalescing the global memory accesses can signifi-cantly reduce the memory overheads on the GPU Multipleglobal memory loads whose addresses fall within the 128-byte range are combined into one request and sent tothe memory This saves the memory bandwidth a lot andimproves the performance In order to reduce the costs for theuncoalesced accesses we use the pull scheme [12] Choosingthe pull scheme comes from the observation that the costof the uncoalesced reading is smaller than the cost of theuncoalesced writing

Algorithm 5 shows the LBM algorithm using the pushscheme At the first step the pdfs are copied directly fromthe current cell These pdfs are used to calculate the pdfsat the new time step (collision phase) The new pdfs arethen streamed to the adjacent cells (streaming phase) At

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

6 Scientific Programming

C N S WT WB C N S WT WB C N S WT WB

Cell 0 Cell 1

Thread 0 Thread 1

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

Cell (16 times 16 times 16 minus 1)

Figure 6 AoS scheme

C C C C C N N N N N WB WB WB WB WB

Cell 0 Cell 0 Cell 0Cell 1 Cell 1 Cell 1

Thread 0 Thread 0 Thread 0Thread 1 Thread 1 Thread 1

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

Cell(16 times 16 times 16) minus 1 Cell

(16 times 16 times 16) minus 1Cell

(16 times 16 times 16) minus 1

Figure 7 SoA scheme

(ii) Structure of arrays (SoA) the value of one distribu-tion of all cells is arranged consecutively in memory(Figure 7) This scheme is more suitable for the GPUarchitecture as we will show in the experimentalresults (Section 5)

413 Data Placement In order to efficiently utilize themem-ory hierarchy of the GPU the placement of the major datastructures of the LBM is crucial [5] In our implementationwe use the following arrays 119904119903119888 119892119903119894119889 119889119904119905 119892r119894119889 119905119910119901119890119904 119886119903119903119897119888 119886119903119903 and 119899119887 119886119903119903 We use the following data placements forthese arrays

(i) 119904119903119888 119892119903119894119889 and 119889119904119905 119892119903119894119889 are used to store the inputgrid and the result grid They are swapped at theend of each time step by exchanging their pointersinstead of explicit storing to and loading from thememory through a temporary array Since 119904119903119888 119892119903119894119889and 119889119904119905 119892119903119894119889 are very large size arrays with a lot ofdata stores and loads we place them in the globalmemory

(ii) In 119905119910119901119890119904 119886119903119903 array the types of the grid cells arestored We use the Lid Driven Cavity (LDC) as thetest case in this paper The LDC consists of a cubefilled with the fluid One side of the cube servesas the acceleration plane by sliding constantly Theacceleration is implemented by assigning the cellsin the acceleration area at a constant velocity Thischange requires three types of cells regular fluidacceleration cells or boundaryThus we also need 1Darray 119905119910119901119890119904 119886119903119903 in order to store the types of each cellin the grid The size of this array is 119873119909 times 119873119910 times 119873119911elements For example if the model is D3Q19 with119873119909 = 16119873119910 = 16 and119873119911 = 16 the size of the arrayis 16 times 16 times 16 = 4096 elements Thus 119905119910119901119890119904 119886119903119903 is

a large array also and contains constant values Thusthey are notmodified throughout the execution of theprogram For these reasons the texturememory is theright place for this array

(iii) 119897119888 119886119903119903 and 119899119887 119886119903119903 are used to store the base indicesfor accesses to 19 directions of the current cell andthe neighbor cells respectively There are 19 indicescorresponding to 19 directions ofD3Q19modelTheseindices are calculated at the start of the programand used till the end of the program executionThus we use the constant memory to store them Asstanding at any cell we use the following formula todefine the position in the 1D array of any directionout of 19 cell directions 119888119906119903119903 119889119894119903 119901119900119904 119894119899 119886119903119903 =119888119890119897119897 119901119900119904 + 119897119888 119886119903119903[119889119894119903119890119888119905119894119900119899] (for the current cell) and119899119887 119889119894119903 119901119900119904 119894119899 119886119903119903 = 119899119887 119901119900119904 + 119899119887 119886119903119903[119889119894119903119890119888119905119894119900119899] (forthe neighbor cells)

414 Using Pull Scheme to Reduce Costs for UncoalescedAccesses Coalescing the global memory accesses can signifi-cantly reduce the memory overheads on the GPU Multipleglobal memory loads whose addresses fall within the 128-byte range are combined into one request and sent tothe memory This saves the memory bandwidth a lot andimproves the performance In order to reduce the costs for theuncoalesced accesses we use the pull scheme [12] Choosingthe pull scheme comes from the observation that the costof the uncoalesced reading is smaller than the cost of theuncoalesced writing

Algorithm 5 shows the LBM algorithm using the pushscheme At the first step the pdfs are copied directly fromthe current cell These pdfs are used to calculate the pdfsat the new time step (collision phase) The new pdfs arethen streamed to the adjacent cells (streaming phase) At

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 7

(1) global void soa push kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Gather 19 pdfs from the current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11) Calculate the pdfs at new time step

(12)(13) Stream 19 pdfs to the adjacent cells

(14)

Algorithm 5 Kernel using the push scheme

(1) global void soa pull kernel(float

source grid float dest grid unsigned

char flags)

(2) (3) Stream 19 pdfs from adjacent cells to the

current cell

(4)(5) Apply boundary conditions

(6)(7) Calculate the mass density 120588 and the velocity u(8)(9) Calculate the local equilibrium distribution

functions 119891(eq) using 120588 and u(10)(11)(12) Calculate the pdf at new time step

(13)(14) Save 19 values of pdf to the current cell

(15)

Algorithm 6 Kernel using the pull scheme

the streaming phase the distribution values are updated toneighbors after they are calculated All distribution valueswhich do not move to the east or west direction (119909-directionvalues equal to 0) can be updated to the neighbors (writeto the device memory) directly without any misalignmentHowever other distribution values (119909-direction values equalto +1 or minus1) need to be considered carefully because oftheir misaligned update positions The update positions areshifted to the memory locations that do not belong tothe 128-byte segment while thread indexes are not shiftedcorrespondingly So the misaligned accesses occur and theperformance can degrade significantly

If we use the pull scheme on the other hand the orderof the collision phase and the streaming phase in the LBMkernel is reversed (see Algorithm 6) At the first step of the

pull scheme the pdfs from adjacent cells are gathered to thecurrent cell (streaming phase) (Lines 3ndash5) Next these pdfsare used to calculate the pdfs at the new time step and thesenew pdfs are then stored to the current cell directly (collisionphase) Thus in the pull scheme the uncoalesced accessesoccur when the data is read from the devicememory whereasthey occur when the data is written in the push scheme As aresult the cost of the uncoalesced accesses is smaller with thepull scheme

42 Tiling Optimization with Data Layout Change In theD3Q19 model of the LBM as computations for the streamingand the collision phases are conducted for a certain cell 19distribution values which belong to 19 surrounding cells areaccessed Figure 8 shows the data accesses to the 19 cells when

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

8 Scientific Programming

Pi+1

Pi

Piminus1

Ny

Nx

Figure 8 Data accesses for orange cell in conducting computationsfor streaming and collision phases

a thread performs the computations for the orange coloredcell The 19 cells (18 directions (green cells) + current cell inthe center (orange cell)) are distributed on the three differentplanes Let 119875119894 be the plane containing the current computing(orange) cell and let 119875119894minus1 and 119875119894+1 be the lower and upperplanes respectively 119875119894 plane contains 9 cells 119875119894minus1 and 119875119894+1planes contain 5 cells respectively When the computationsfor the cell for example (119909 119910 119911) = (1 1 1) are performedthe following cells are accessed

(i) 1198750 plane (0 1 0) (1 0 0) (1 1 0) (1 2 0) (2 1 0)(ii) 1198751 plane (0 0 1) (0 1 1) (0 2 1) (1 0 1) (1 1 1)(1 2 1) (2 0 1) (2 1 1) (2 2 1)(iii) 1198752 plane (0 1 2) (1 0 2) (1 1 2) (1 2 2) (2 1 2)

The 9 accesses for 1198751 plane are divided into three groups(001) (011) (021) (101) (111) (121) (201)(211) (221) Each group accesses the consecutivememorylocations belonging to the same row Accesses of the differentgroups are separated apart and lead to the uncoalescedaccesses on the GPU when 119873119909 is sufficiently large In eachof 1198750 and 1198752 planes there are three groups of accesses eachHere the accesses of the same group touch the consecutivememory locations and accesses of the different groups areseparated apart in thememory which lead to the uncoalescedaccesses also Accesses to the data elements in the differentplanes (1198750 1198751 and 1198752) are further separated apart and alsolead to the uncoalesced accesses when119873119910 is sufficiently large

As the computations proceed three rows in the 119910-dimension of 1198750 1198751 1198752 planes will be accessed sequentiallyfor 119909 = 0 sim 119873119909 minus 1 119910 = 0 1 2 followed by 119909 = 0 sim 119873119909 minus 1119910 = 1 2 3 119909 = 0 sim 119873119909 minus 1 119910 = 119873119910 minus 3119873119910 minus 2119873119910 minus 1When the complete 1198750 1198751 1198752 planes are swept then similardata accesses will continue for 1198751 1198752 and 1198753 planes and soon Therefore there are a lot of data reuses in 119909- 119910- and 119911-dimensions As explained in Section 412 the 3D lattice grid

is stored in the 1D array The 19 cells for the computationsbelonging to the same plane are stored plusmn1 or plusmn119873119909 + plusmn1 cellsaway The cells in different planes are stored plusmn119873119909 times 119873119910 +plusmn119873119909 + plusmn1 cells away The data reuse distance along the 119909-dimension is short +1 or +2 loop iterations apart The datareuse distance along the 119910- and 119911-dimensions is plusmn119873119909 + plusmn1 orplusmn119873119909 times 119873119910 + plusmn119873119909 + plusmn1 iterations apart If we can make thedata reuse occur faster by reducing the reuse distances forexample using the tiling optimization it can greatly improvethe cache hit ratio Furthermore it can reduce the overheadswith the uncoalesced accesses because lots of global memoryaccesses can be removed by the cache hits Therefore we tilethe 3D lattice grid into smaller 3D blocks We also changethe data layout in accordance with the data access patterns ofthe tiled code in order to store the data elements in differentgroups closer in the memory Thus we can remove a lot ofuncoalesced memory accesses because they can be storedwithin 128-byte boundary In Sections 421 and 422 wedescribe our tiling and data layout change optimizations

421 Tiling Let us assume the following

(i) 119873119909 119873119910 and 119873119911 are sizes of the grid in 119909- 119910- and 119911-dimension

(ii) 119899119909 119899119910 and 119899119911 are sizes of the 3D block in 119909- 119910- and119911-dimension

(iii) 119909119910-plane is a subplane which is composed of (119899119909times119899119910)cells

We tile the grid into small 3D blocks with the tile sizes of 119899119909119899119910 and 119899119911 (yellow block in Figure 9(a)) where

119899119909 = 119873119909 divide 119909119888119899119910 = 119873119910 divide 119910119888119899119911 = 119873119911 divide 119911119888

119909119888 119910119888 119911119888 = [1 2 3 )

(8)

We let each CUDA thread block process one 3D tiledblock Thus 119899119911 119909119910-planes need to be loaded for each threadblock In each 119909119910-plane each thread of the thread blockexecutes the computations for one grid cellThus each threaddeals with a column containing 119899119911 cells (the red column inFigure 9(b)) If 119911119888 = 1 each thread processes 119873119911 cells andif 119911119888 = 119873119911 each thread processes only one cell The tilesize can be adjusted by changing the constants 119909119888 119910119888 and 119911119888These constants need to be selected carefully to optimize theperformance Using the tiling the number of created threadsis reduced by 119911119888-times

422 Data Layout Change In order to further improvebenefits of the tiling and reduce the overheads associatedwith the uncoalesced accesses we propose to change thedata layout Figure 10 shows one 119909119910-plane of the grid withand without the layout change With the original layout(Figure 10(a)) the data is stored in the row major fashionThus the entire first row is stored followed by the second

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 9

x

y

zBlock

(a) Grid divided into 3D blocks (b) Block containing subplanes The red column contains all cells one thread processes

Figure 9 Tiling optimization for LBM

Block 3

Block 3

Block 0

Block 0

Block 1

Block 1

Block 2

Block 2

(a) Original data layout (b) Proposed data layout

Figure 10 Different data layouts for blocks

row and so on In the proposed new layout the cells in thetiled first row in Block 0 are stored first Then the secondtiled row of Block 0 is stored instead of the first row ofBlock 1 (Figure 10(b)) With the layout change the data cellsaccessed in the consecutive iterations of the tiled code areplaced sequentially This places the data elements of thedifferent groups closer Thus it increases the possibility forthese memory accesses to the different groups coalesced ifthe tiling factor and the memory layout factor are adjustedappropriately This can further improve the performancebeyond the tiling

The data layout can be transformed using the followingformula

indexnew = 119909id + 119910id times 119873119909 + 119911id times 119873119909 times 119873119910 (9)

where 119909id and 119910id are cell indexes in 119909- and 119910-dimension onthe plane of grid and 119911119894119889 is the value in the range of 0 to 119899119911minus1119909id and 119910id can be calculated as follows

119909id = (block index in 119909-dimension)times (number of threads in thread block in 119909-dimension)+ (thread index in thread block in 119909-dimension)

119910id = (block index in 119910-dimension)

times (number of threads in thread block in 119910-dimension)+ (thread index in thread block in 119910-dimension)

(10)

In our implementation we use the changed input datalayout stored offline before the program starts (The originalinput is changed to the new layout and stored to the input file)Then the input file is used while conducting the experiments

43 Reduction of Register Uses perThread TheD3Q19 modelis more precise than the models with smaller distributionssuch as D2Q9 orD3Q13 thus usingmore variablesThis leadsto more register uses for the main computation kernels InGPU the register use of the threads is one of the factorslimiting the number of active WARPs on a streaming mul-tiprocessor (SMX) Higher register uses can lead to the lowerparallelism and occupancy (see Figure 11 for an example)which results in the overall performance degradation TheNvidia compiler provides a flag to limit the register uses to acertain limit such as minus119898119886119909119903119903119890119892119888119900119906119899119905 or launch bounds ()qualifier [5] The minus119898119886119909119903119903119890119892119888119900119906119899119905 switch sets a maximumon the number of registers used for each thread Thesecan help increase the occupancy by reducing the registeruses per thread However our experiments show that theoverall performance goes down because they lead to a lot ofregister spillsrefills tofrom the local memoryThe increased

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

10 Scientific Programming

Block 0 Block 0 Thread 0

Thread 0

Thread 1Block 0 Thread 63

Thread 1Block 7 Block 7

Thread 63Block 7

Register file

middot middot middot

middot middot middot

(a) Eight blocks 64 threads per block and 4 registers per thread

Block 0 Thread 0

Block 0 Thread 31

Thread 0 Block 7

Thread 31Block 7

Register file

middot middot middot

middot middot middot

(b) Eight blocks 32 threads per block and 8 registers per thread

Figure 11 Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads withlarger register uses

memory traffic tofrom the local memory and the increasedinstruction count for accessing the local memory hurt theperformance

In order to reduce the register uses per thread whileavoiding the register spillrefill tofrom the local memory weused the following techniques

(i) Calculate indexing address of distributions manuallyEach cell has 19 distributions thus we need 38variables for storing indexes (19 distributions times 2memory accesses for load and store) in the D3Q19model However each index variable is used only onetime at each execution phase Thus we can use onlytwo variables instead of 38 one for calculating theloading indexes and one for calculating the storingindexes

(ii) Use the shared memory for the commonly used vari-ables among threads for example to store the baseaddresses

(iii) Casting multiple small size variables into one largevariable for example we combined 4 char type vari-ables into one integer variable

(iv) For simple operations which can be easily calculatedwe do not store them to the memory variablesInstead we recompute them later

(v) We use only one array to store the distributionsinstead of using 19 arrays separately

(vi) In the original LBM code a lot of variables aredeclared to store the FP computation results whichincrease the register uses In order to reduce the

register uses we attempt to reuse variables whoselife-time ended earlier in the former code This maylower the instruction-level parallelism of the kernelHowever it helps increase the thread-level parallelismas more threads can be active at the same time withthe reduced register uses per thread

(vii) In the original LBM code there are some complicatedfloating-point (FP) intensive computations used in anumber of nearby statementsWe aggressively extractthese computations as the common subexpressions Itfrees the registers involved in the common subexpres-sions thus reducing the register uses It also reducesthe number of dynamic instruction counts

Applying the above techniques in our implementationthe number of registers in each kernel is greatly reduced from70 registers to 40 registers It leads to the higher occupancy forthe SMXs and the significant performance improvements

44 Removing Branch Divergence Flow control instructionson the GPU cause the threads of the same WARP to divergeThus the resulting different execution paths get serializedThis can significantly affect the performance of the appli-cation program on the GPU Thus the branch divergenceshould be avoided as much as possible In the LBM codethere are two main problems which can cause the branchdivergence

(i) Solving the streaming at the boundary positions

(ii) Defining actions for corresponding cell types

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 11

Ghost layerBoundary

Others

Figure 12 Illustration of the lattice with a ldquoghost layerrdquo

(1) IF(cell type == FLUID)

(2) x = a

(3) ELSE

(4) x = b

Algorithm 7 Skeleton of IF-statement used in LBM kernel

(1) is fluid = (cell type == FLUID)

(2) x = a is fluid + b (is fluid)

Algorithm 8 Code with IF-statement removed

In order to avoid using 119868119865-119904119905119886119905119890119898119890119899119905119904 while streaming atthe boundary position a ldquoghost layerrdquo is attached in 119910- and119911-dimension If 119873119909 119873119910 and 119873119911 are the width height anddepth of the original grid 119873119873119909 = 119873119909 119873119873119910 = 119873119910 + 1 and119873119873119911 = 119873119911 + 1 are the new width height and depth of thegrid with the ghost layer (Figure 12) With the ghost layer wecan regard the computations at the boundary position as thenormal oneswithoutworrying about running out of the indexbound

As explained in Section 412 cells of the grid belong tothree types such as the regular fluid the acceleration cellsor the boundary The LBM kernel contains conditions todefine actions for each type of the cell The boundary celltype can be covered in the above-mentioned way using theghost layer This leads to the existence of the other twodifferent conditions in the same halfWARP of theGPUThusin order to remove 119868119865-119888119900119899119889119894119905119894119900119899119904 we combine conditionsinto computational statements Using this technique the 119868119865-119904119905119886119905119890119898119890119899119905 in Algorithm 7 is rewritten as in Algorithm 8

5 Experimental Results

In this section we first describe the experimental setupThenwe show the performance results with analyses

40

60

80

100

120

SerialAoS

0

20Perfo

rman

ce (M

LUPS

)

Domain size643 1283 1923 2563

Figure 13 Performance (MLUPS) comparison of the serial and theAoS with different domain sizes

51 Experimental Setup We implemented the LBM in thefollowing five ways

(i) Serial implementation using single CPU core (serial)using the source code from the SPEC CPU 2006470lbm [15] to make sure it is reasonably optimized

(ii) Parallel implementation on a GPU using the AoS datascheme (AoS)

(iii) Parallel implementation using the SoA data schemeand the push scheme (SoA Push Only)

(iv) Parallel implementation using the SoA data schemeand the pull scheme (SoA Pull Only)

(v) SoAusing pull schemewith our various optimizationsincluding the tiling with the data layout change(SoA Pull lowast)

We summarize our implementations in Table 2 We usedthe D3Q19 model for the LBM algorithm Domain grid sizesare scaled in the range of 643 1283 1923 and 2563 Thenumbers of time steps are 1000 5000 and 10000

In order to measure the performance of the LBM theMillion Lattice Updates Per Second (MLUPS) unit is usedwhich is calculated as follows

MLUPS = 119873119909 times 119873119910 times 119873119911 times 119873ts

106 times 119879 (11)

where119873119909119873119910 and119873119911 are domain sizes in the 119909- 119910- and 119911-dimension119873ts is the number of time steps used and 119879 is therun time of the simulation

Our experiments were conducted on a system incorpo-rating the Intel multicore processor (6-core 20Ghz IntelXeon E5-2650) with 20MB level-3 cache and Nvidia TeslaK20 GPU based on the Kepler architecture with 5GB devicememory The OS is CentOS 55 In order to validate theeffectiveness of our approach over the previous approacheswe have also conducted further experiments on anotherGPUNvidia GTX285 GPU

52 Results Using Previous Approaches The average perfor-mances of the serial and the AoS are shown in Figure 13

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

12 Scientific Programming

Table 2 Summary of experiments

Experiment DescriptionSerial Serial implementation on single CPU coreParallelAoS AoS schemeSoA Push Only SoA scheme + push data schemeSoA Pull

SoA Pull Only SoA scheme + pull data schemeSoA Pull BR SoA scheme + pull data scheme + branch divergence removalSoA Pull RR SoA scheme + pull data scheme + register reductionSoA Pull Full SoA scheme + pull data scheme + branch divergence removal + register usage reduction

SoA Pull Full Tiling SoA scheme + pull data scheme + branch divergence removal + register usage reduction + tiling with datalayout change

Perfo

rman

ce (M

LUPS

)

400500600700800900

0100200300

SoA_Push_OnlyAoS

Domain size643 1283 1923 2563

Figure 14 Performance (MLUPS) comparison of the AoS and theSoA with different domain sizes

With various domain sizes of 643 1283 1923 and 2563 theMLUPS numbers for the serial are 982MLUPS 742MLUPS957 MLUPS and 893 MLUPS The MLUPS for the AoSare 11206 MLUPS 7869 MLUPS 7686 MLUPS and 7499MLUPS respectively With these numbers as the baseline wealsomeasured the SoAperformance for various domain sizesFigure 14 shows that the SoA significantly outperforms theAoS scheme The SoA is faster than the AoS by 663 9911028 and 1049 times for the domain sizes 643 1283 1923and 2563 Note that in this experiment we applied only theSoA scheme without any other optimization techniques

Figure 15 compares the performance of the pull schemeand the push scheme For fair comparison we did notapply any other optimization techniques to both of theimplementationsThe pull scheme performs at 7973MLUPS8384 MLUPS 8498 MLUPS and 84837 MLUPS whereasthe push scheme performs at 7434 MLUPS 780 MLUPS79016 MLUPS and 78713 MLUPS for domain sizes 6431283 1923 and 2563 respectively Thus the pull scheme isbetter than the push scheme by 675 697 702 and72 respectivelyThenumber of globalmemory transactions

760780800820840860

Perfo

rman

ce (M

LUPS

)

680700720740

SoA_Pull_OnlySoA_Push_Only

Domain size643 1283 1923 2563

Figure 15 Performance (MLUPS) comparison of the SoA usingpush scheme and pull scheme with different domain sizes

observed shows that the total transactions (loads and stores)of the pull and push schemes are quite equivalent Howeverthe number of store transactions of the pull scheme is 562smaller than the push scheme This leads to the performanceimprovement of the pull scheme compared with the pushscheme

53 Results Using Our Optimization Techniques In this sub-section we show the performance improvements of our opti-mization techniques compared with the previous approachbased on the SoA Pull implementation

(i) Figure 16 compares the average performance of theSoA with and without removing the branch diver-gences explained in Section 44 in the kernel codeRemoving the branch divergence improves the per-formance by 437 445 469 and 519 fordomain sizes 643 1283 1923 and 2563 respectively

(ii) Reducing the register uses described in Section 43improves the performance by 1207 1244 1198and 1258 as Figure 17 shows

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 13Pe

rform

ance

(MLU

PS)

820840860880900920

740760780800

SoA_Pull_OnlySoA_Pull_BR

Domain size643 1283 1923 2563

Figure 16 Performance (MLUPS) comparison of the SoA with andwithout branch removal for different domain sizes

750

800

850

900

950

1000

Perfo

rman

ce (M

LUPS

)

600

650

700

750

SoA_Pull_OnlySoA_Pull_RR

Domain size643 1283 1923 2563

Figure 17 Performance (MLUPS) comparison of the SoA with andwithout reducing register uses for different domain sizes

(iii) Figure 18 compares the performance of the SoA usingthe pull scheme with optimization techniques suchas the branch divergence removal and the registerusage reduction described in Sections 43 and 44(SoA Pull Full) and the SoA Pull Only The opti-mized SoA Pull implementation is better than theSoA Pull Only by 1644 1689 1668 and 1777for the domain sizes 643 1283 1923 and 2563respectively

(iv) Figure 19 shows the performance comparison ofthe SoA Pull Full and SoA Pull Full Tiling TheSoA Pull Full Tiling performance is better than theSoA Pull Full from 1178 to 136 The domainsize 1283 gives the best performance improvement of136 while the domain size 2563 gives the lowestimprovement of 1178 The experimental resultsshow that the tiling size for the best performance is119899119909 = 32 119899119910 = 16 and 119899119911 = 119873119911 divide 4

600

800

1000

1200

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_OnlySoA_Pull_Full

Domain size643 1283 1923 2563

Figure 18 Performance (MLUPS) comparison of the SoA with andwithout optimization techniques for different domain sizes

600

800

1000

1200

1400

0

200

400

Perfo

rman

ce (M

LUP)

SoA_Pull_Full_TilingSoA_Pull_Full

Domain size643 1283 1923 2563

Figure 19 Performance (MLUPS) comparison of the SoA Pull Fulland the SoA Pull Full Tiling with different domain sizes

(v) Figure 20 presents the overall performance of theSoA Pull Full Tiling implementation compared withthe SoA Pull Only With all our optimization tech-niques described in Sections 42 43 and 44 weobtained 28 overall performance improvementscompared with the previous approach

(vi) Table 3 compares the performance of four im-plementations (serial AoS SoA Pull Only andSoA Pull Full Tiling) with different domain sizesAs shown the peak performance of 121063 MLUPSis achieved by the SoA Pull Full Tiling with domainsize 2563 where the speedup of 136 is also achieved

(vii) Table 4 compares the performance of our work withthe previous work conducted by Mawson and Revell[12] Both implementations were conducted on thesame K20 GPU Our approach performs better than[12] from 14 to 19 Our approach incorporates

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

14 Scientific Programming

Table 3 Performance (MLUPS) comparisons of four implementations

Domain sizes TimeSteps Serial AoS SoA Pull Only SoA Pull Full Tiling

6431000 989 11173 75952 10345000 975 11224 81401 11156310000 982 11173 81836 112932

Avg Perf 982 11241 79730 109299

12831000 764 7858 79821 111525000 765 7874 85555 11891510000 698 7874 86142 119933

Avg Perf 742 7869 83839 116769

19231000 947 7679 81135 1114965000 969 7689 86676 11859510000 956 7691 87139 120548

Avg Perf 957 7686 84983 11688

25631000 899 7474 78776 1113775000 891 7509 8738 11823310000 89 77514 88356 121063

Avg Perf 893 7499 84837 116891

600

800

1000

1200

1400

Perfo

rman

ce (M

LUPS

)

0

200

400

SoA_Pull_Full_TilingSoA_Pull_Only

Domain size643 1283 1923 2563

Figure 20 Performance (MLUPS) comparison of theSoA Pull Only and the SoA Pull Tiling with different domainsizes

Table 4 Performance (MLUPS) comparison of our work withprevious work [12]

Domain sizes Mawson and Revell (2014) Our work643 914 11291283 990 11991923 1036 12052563 1020 1210

more optimization techniques such as the tiling opti-mization with the data layout change the branchdivergence removal among others compared with[12]

SerialAoSSoA_Push_OnlySoA_Pull_Only

SoA_Pull_RRSoA_Pull_BRSoA_Pull_FullSoA_Pull_Full_Tiling

Domain size643 1283 1603

Perfo

rman

ce (M

LUPS

)

400

350

300

250

200

150

100

50

0

Figure 21 Performance (MLUPS) on GTX285 with differentdomain sizes

(viii) In order to validate the effectiveness of our approachwe conducted more experiments on the other GPUNvidia GTX285 Table 5 and Figure 21 show theaverage performance of our implementations withdomain sizes 643 1283 and 1603 (The grid sizeslarger than 1603 cannot be accommodated in thedevice memory of the GTX 285) As shown ouroptimization technique SoA Pull Full Tiling is bet-ter than the previous SoA Pull Only up to 2285Alsowe obtained 46-time speedup comparedwith theserial implementation The level of the performanceimprovement and the speedup are however lower onthe GTX 285 compared with the K20

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Scientific Programming 15

Table 5 Performance (MLUPS) on GTX285

Domain sizes Serial AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tiling643 982 4522 24097 25736 26845 28324 29673 328871283 742 4985 24212 25904 27058 28568 29979 335771603 957 5015 23758 25418 26625 27959 29426 32862

6 Previous Research

Previous parallelization approaches for the LBM algorithmfocused on two main issues how to efficiently organize thedata and how to avoid the misalignment in the streaming(propagation) phase of the LBM In the data organizationthe AoS and the SoA are two most commonly used schemesWhile AoS scheme is suitable for the CPU architecture SoAis a better scheme for the GPU architecture when the globalmemory access coalition technique is incorporated Thusmost implementations of the LBM on the GPU use the SoAas the main data organization

In order to avoid the misalignment in the streamingphase of the LBM there are two main approaches The firstproposed approach uses the shared memory Tolke in [9]used the approach and implemented the D2Q9 modelHabich et al [8] followed the same approach for the D3Q19model Bailey et al in [16] also used the shared memory toachieve 100 coalescence in the propagation phase for theD3Q13 model In the second approach the pull scheme wasused instead of the push scheme without using the sharedmemory

As observed the main aim of using the sharedmemory isto avoid the misaligned accesses caused by the distributionvalues moving to the east and west directions [8 9 17]However the shared memory implementation needs extrasynchronizations and intermediate registers This lowers theachieved bandwidth In addition using the shared memorylimits the maximum number of threads per thread blockbecause of the limited size of the shared memory [17] whichreduces the number of activeWARPs (occupancy) of the ker-nels thereby hurting the performance Using the pull schemeinstead there is no extra synchronization cost incurredand no intermediate registers are needed In addition thebetter utilization of the registers in the pull scheme leads togenerating a larger number of threads as the total numberof registers is fixed This leads to better utilization of theGPUrsquos multithreading capability and higher performanceLatest results in [12] confirm the higher performance ofthe pull scheme compared with using the shared mem-ory

Besides the above approaches in [12] the new featureof the Tesla K20 GPU shuffle instruction was applied toavoid the misalignment in the streaming phase However theobtained results were worse In [18] Obrecht et al focused onchoosing careful data transfer schemes in the global memoryinstead of using the shared memory in order to solve themisaligned memory access problem

There were some approaches to maximize the GPUmultiprocessor occupancy by reducing the register uses perthread Bailey et al in [16] showed 20 improvement in

maximum performance compared with the D3Q19 model in[8]They set the number of registers used by the kernel belowa certain limit using the Nvidia compiler flag However thisapproach may spill the register data to the local memoryHabich et al [8] suggested a method to reduce the number ofregisters by using the base index which forces the compilerto reuse the same register again

A few different implementations of the LBM were at-tempted Astorino et al [19] built a GPU implementationframework for the LBM valid for the two- and three-dimensional problemsThe framework is organized in amod-ular fashion and allows for easy modification They used theSoA scheme and the semidirect approach as the addressingschemeThey also adopted the swapping technique to save thememory required for the LBM implementation Rinaldi et al[17] suggested an approach based on the single-step algorithmwith a reversed collision-propagation scheme They used theshared memory as the main computational memory insteadof the global memory In our implementation we adoptedthese approaches for the SoA Pull Only implementationshown in Section 5

7 Conclusion

In this paper we developed high performance paralleliza-tion of the LBM algorithm with the D3Q19 model on theGPU In order to improve the cache locality and minimizethe overheads associated with the uncoalesced accesses inmoving the data to the adjacent cells in the streamingphase of the LBM we used the tiling optimization withthe data layout change For reducing the high registerpressure for the LBM kernels and improving the availablethread parallelism generated at the run time we developedtechniques for aggressively reducing the register uses forthe kernels We also developed optimization techniquesfor removing the branch divergence Other already-knowntechniques were also adopted in our parallel implementationsuch as combining the streaming phase and the collisionphase into one phase to reduce the memory overhead aGPU friendly data organization scheme so-called the SoAscheme efficient data placement of the major data structuresin the GPU memory hierarchy and adopting a data updatescheme (pull scheme) to further reduce the overheads ofthe uncoalesced accesses Experimental results on the 6-core22 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPUusing CUDA show that our approach leads to impressive per-formance results It delivers up to 121063MLUPS throughputperformance and achieves up to 136-time speedup com-pared with a serial implementation running on single CPUcore

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 16: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

16 Scientific Programming

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research was supported by Next-Generation Infor-mation Computing Development Program through theNational Research Foundation of Korea (NRF) funded bythe Ministry of Education Science and Technology (NRF-2015M3C4A7065662) This work was supported by theHuman Resources Program in Energy Technology of theKorea Institute of Energy Technology Evaluation and Plan-ning (KETEP) granted financial resource from the Min-istry of Trade Industry amp Energy Republic of Korea (no20154030200770)

References

[1] J-P Rivet and J P Boon Lattice Gas Hydrodynamics vol 11Cambridge University Press Cambridge UK 2005

[2] J Tolke and M Krafczyk ldquoTeraFLOP computing on a desktopPC with GPUs for 3D CFDrdquo International Journal of Computa-tional Fluid Dynamics vol 22 no 7 pp 443ndash456 2008

[3] G Crimi FMantovaniM Pivanti S F Schifano and R Tripic-cione ldquoEarly experience on porting and running a Lattice Boltz-mann code on the Xeon-Phi co-processorrdquo in Proceedings of the13th Annual International Conference on Computational Science(ICCS rsquo13) vol 18 pp 551ndash560 Barcelona Spain June 2013

[4] M Sturmer J Gotz G Richter A Dorfler and U RudeldquoFluid flow simulation on the Cell Broadband Engine usingthe lattice Boltzmann methodrdquo Computers ampMathematics withApplications vol 58 no 5 pp 1062ndash1070 2009

[5] NVIDA CUDA Toolkit Documentation September 2015httpdocsnvidiacomcudaindexhtml

[6] GROUP KHRONOS OpenCL 2015 httpswwwkhronosorgopencl

[7] OpenACC-standardorg OpenACC March 2012 httpwwwopenaccorg

[8] J Habich T Zeiser G Hager and G Wellein ldquoPerformanceanalysis and optimization strategies for a D3Q19 lattice Boltz-mann kernel on nVIDIA GPUs using CUDArdquo Advances inEngineering Software vol 42 no 5 pp 266ndash272 2011

[9] J Tolke ldquoImplementation of a Lattice Boltzmann kernel usingthe compute unified device architecture developed by nVIDIArdquoComputing and Visualization in Science vol 13 no 1 pp 29ndash392010

[10] G Wellein T Zeiser G Hager and S Donath ldquoOn the singleprocessor performance of simple lattice Boltzmann kernelsrdquoComputers and Fluids vol 35 no 8-9 pp 910ndash919 2006

[11] MWittmann T Zeiser GHager andGWellein ldquoComparisonof different propagation steps for lattice Boltzmann methodsrdquoComputers andMathematics with Applications vol 65 no 6 pp924ndash935 2013

[12] M J Mawson and A J Revell ldquoMemory transfer optimizationfor a lattice Boltzmann solver on Kepler architecture nVidiaGPUsrdquo Computer Physics Communications vol 185 no 10 pp2566ndash2574 2014

[13] N Tran M Lee and D H Choi ldquoMemory-efficient par-allelization of 3D lattice boltzmann flow solver on a GPUrdquoin Proceedings of the IEEE 22nd International Conference onHigh Performance Computing (HiPC rsquo15) pp 315ndash324 IEEEBangalore India December 2015

[14] K Iglberger Cache Optimizations for the Lattice BoltzmannMethod in 3D vol 10 Lehrstuhl fur Informatik WurzburgGermany 2003

[15] J L Henning ldquoSPECCPU2006 Benchmark descriptionsrdquoACMSIGARCH Computer Architecture News vol 34 no 4 pp 1ndash172006

[16] P Bailey J Myre S D C Walsh D J Lilja and M O SaarldquoAccelerating lattice boltzmann fluid flow simulations usinggraphics processorsrdquo in Proceedings of the 38th InternationalConference on Parallel Processing (ICPP rsquo09) pp 550ndash557 IEEEVienna Austria September 2009

[17] P R Rinaldi E A DariM J Venere andA Clausse ldquoA Lattice-Boltzmann solver for 3D fluid simulation on GPUrdquo SimulationModelling Practice andTheory vol 25 pp 163ndash171 2012

[18] C Obrecht F Kuznik B Tourancheau and J-J Roux ldquoAnew approach to the lattice Boltzmann method for graphicsprocessing unitsrdquo Computers amp Mathematics with Applicationsvol 61 no 12 pp 3628ndash3638 2011

[19] M Astorino J B Sagredo and A Quarteroni ldquoA modularlattice boltzmann solver for GPU computing processorsrdquo SeMAJournal vol 59 no 1 pp 53ndash78 2012

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 17: Research Article Performance Optimization of 3D Lattice ...ants.mju.ac.kr/publication/Scientific Programming-2017.pdf · Inthefunction LBM ,thecollide functionand stream function

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014