high-order spectral difference: verification and acceleration using

High-Order Spectral Difference: Verification and

Acceleration using GPU Computing

Ben J. Zimmerman∗

Department of Aerospace Engineering, Iowa State University, Ames, IA 50011

Z. J. Wang†

Department of Aerospace Engineering, University of Kansas, Lawrence, KS 66045

Miguel R. Visbal‡

Air Force Research Labratory, WPAFB, Dayton, OH 45433

A high-order spectral difference (SD) method has been developed with graphics pro-cessing units (GPUs) using compute unified device architecture (CUDA). It solves thethree-dimensional Navier-Stokes equations on unstructured hexahedral grids with Runge-Kutta time integration. The method is efficient since operations are completed in a one-dimensional fashion and the equations are solved in differential form, removing explicitsurface and volume integral calculations. Additionally, solution and flux reconstructionsare completed locally per cell, increasing the parallelization of the implementation. Due tothis efficiency, the application of GPU computing is appealing. This paper presents the SDmethod implementation with GPU CUDA computing and presents accuracy studies withisotropic vortex propagation and Couette flow, verifies the high-order accuracy of the solverwith a numerical sensitive aero-acoustic problem, and compares the developed solver and ahigh-order finite difference solver with a case presented in the 1st International Workshopon High-Order CFD Methods. Finally, the GPU solver is compared to a similar centralprocessing unit (CPU) solver, where speed-ups ranging from 20-40x faster are illustrated.

Nomenclature

Cp Coefficient of pressureE Total non-dimensional energyF,G,H Vector of fluxes

F , G, H Vector of transformed fluxesγ Ratio of specific heatsh Lagrange interpolation polynomial on solution pointsi, j, k Index of coordinates in x,y,z directionl Lagrange interpolation polynomial on flux pointsJ Jacobian matrixµ Dynamic viscositync Total number of cells in the domainnsp,nfp Number of solution points and flux points in one-dimensionp Non-dimensional pressurePr Prandtl number

Q, Q Vector of conservation variables in Cartesian coordinates and standard unstructured elementsρ Non-dimensional densityT Non-dimensional temperature

∗Masters Research Assistant of Aerospace Engineering, 2362 Howe Hall, [email protected], AIAA Member.†Spahr Professor and Chair of Aerospace Engineering, 2120 Learned Hall, [email protected], Associate Fellow of AIAA.‡Aerospace Engineer, Computational Sciences Branch AFRL, [email protected], Fellow of AIAA.

1 of 17

American Institute of Aeronautics and Astronautics

Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41

21st AIAA Computational Fluid Dynamics Conference

June 24-27, 2013, San Diego, CA

AIAA 2013-2941

Copyright © 2013 by Ben Zimmerman, Z. J. Wang, and Miguel Visbal. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.

u, v, w Non-dimensional velocity in x, y, z directionx, y, z Non-dimensional Cartesian coordinatesξ, η, ζ Non-dimensional coordinates in a standard cubicXs Location of points in standard element

I. Introduction

Computational fluid dynamics (CFD) has long been a useful tool to model flow fields in many engineeringdisciplines. However, as problems continue to grow in complexity, the simulation time increases drastically.Computations utilizing dozens to hundreds of central processing units (CPUs) require an exceptional amountof computation time for simulations with a high number of degrees of freedom (DoF). This long period ofsimulation time introduces large computational cost, which cannot be ignored. Recently, graphics processingunits (GPUs) have been introduced to solve these problems faster. Whereas a CPU contains only a fewcores, a single GPU contains hundreds of cores, allowing vast parallelism. To take advantage of this feature,NVIDIA released CUDA (compute unified device architecture) in 2006,8 allowing the scientific communityto apply NVIDIA GPUs to large complex problems, reducing the computational cost of these numericalsimulations.

Simulation of complex and numerical sensitive problems in CFD tend to increase the computational timeto generate an appropriate solution. The numerical simulation of low-Reynolds number flow over an SD7003airfoil at a low angle of attack17 is one such example. The flow over the airfoil detaches from the surface,transitions to turbulent flow, and reattaches itself at some later point. Due to the high amount of DoFrequired for a numerical simulation, the computational time required is quite large. An acceptable solutionutilizing a 3rd order spectral difference (SD) solver requires approximately 3,500 hours (roughly 145 days)running on 32 CPU cores. GPU computing has demonstrated considerable speed-ups in aerospace sciencesalready,2,5, 12 and applying GPUs to solving the previously described flow would enable solution generationin a fraction of the time, while utilizing only a few GPUs instead of dozens of CPUs.

When choosing a method to implement with GPU CUDA computing, one must consider the cost andcomplexity of the method. It is obvious that methods involving surface or volume integrals would be moreexpensive than methods in differential form, which is particularity true for problems with high-order curvedboundaries. As an example, the well-known finite volume (FV) method1,3 not only contains volume integralsto numerically calculate, but a solution reconstruction which is not local. Reconstruction requires data fromneighboring cells in the domain, limiting the applicability of GPU CUDA computing to the method. Amethod whose reconstruction is completed locally, per cell, would be efficient to implement and allow easyparallelism. The SD method reconstructs and updates the solution locally through the use of solution pointslocated within cells. In addition, if hexahedral cells are employed, the efficiency is increased further, as alloperations are completed in a one-dimensional manner. Hence, the combination of hexahedral cells and theSD method is chosen for implementation with GPU CUDA computing.

This paper is organized in the following manner. In Section 2, the three-dimensional SD method isreviewed. Section 3 covers the CUDA implementation of the method, and Section 4 presents some numericalresults using the newly developed SD CUDA code. Then, timings are presented for both the CPU and GPUCUDA SD codes for several cases in Section 5. Finally, conclusions from the study are summarized in Section6.

II. Review of the Spectral Difference Method

A. Governing Equations

Consider the following 3-D Navier-Stokes equations written in conservation form,

∂Q

∂t+∂F

∂x+∂G

∂y+∂H

∂z= 0. (1)

The fluxes are written as F = F i − F v, G = Gi − Gv, and H = Hi − Hv. The superscript i denotes theinviscid flux vector while v denotes the viscous flux vector. The conserved variables and inviscid flux vectors

2 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


are,

Q =

ρ

ρu

ρv

ρw

E

, F i =

ρu

p+ ρu2

ρuv

ρuw

u(E + p)

, Gi =

ρv

ρuv

p+ ρv2

ρvw

v(E + p)

, Hi =

ρw

ρuw

ρvw

p+ ρw2

w(E + p)

, (2)

while the viscous flux vectors are,

F v =

0

τxx

τxy

τxz

uτxx + vτxy + wτxz +µCp

PrTx

,

Gv =

0

τxy

τyy

τyz

uτxy + vτyy + wτyz +µCp

PrTy

,

Hv =

0

τxz

τyz

τzz

uτxz + vτyz + wτzz +µCp

PrTz

.

(3)

Then, the total energy is,

E =p

γ − 1+

1

2ρ(u2 + v2 + w2). (4)

The stress tensors take the form,

τxx = 2µ

(ux −

ux + vy + wz3

), τyy = 2µ

(vy −

ux + vy + wz3

),

τzz = 2µ

(wz −

ux + vy + wz3

), τxy = µ(vx + uy) = τyx,

τyz = µ(wy + vz) = τzy, τxz = µ(uz + wx) = τzx,

(5)

where the subscripts (x, y, z) denote a partial derivative with respect to that subscript (ux = ∂u∂x ).

B. Coordinate Transformation

The computational domain is filled with non-overlapping hexahedral elements. Both linear and quadraticisoparametric elements are employed, with linear elements used throughout the interior domain and quadraticelements near the high-order curved boundaries. The elements are transformed from the standard coordinatesystem (x, y, z) to a standard cubic element (ξ, η, ζ) ∈ [0, 1] x [0, 1] x [0, 1] as shown in figure 1. Thetransformation takes the following form,xy

z

=

N∑i=1

Mi(ξ, η, ζ)

xiyizi

, (6)

3 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


Figure 1. Transformation from physical to standard element.

where N indicates the number of points to define the physical element and Mi(ξ, η, ζ) represents the shapefunctions. The Jacobian matrix becomes,

J =∂(x, y, z)

∂(ξ, η, ζ)=

xξ xη xζ

yξ yη yζ

zξ zη zζ

. (7)

When the Jacobian is non-singular for a transformation, then its inverse transformation must also exist.Inverting the Jacobian matrix yields,

J−1 =∂(ξ, η, ζ)

∂(x, y, z)=

ξx ξy ξz

ηx ηy ηz

ζx ζy ζz

, (8)

then the metrics are formulated,

ξx =yηzζ − yζzη|J |

, ξy =xζzη − xηzζ|J |

, ξz =xηyζ − xζyη

|J |,

ηx =yζzξ − yξzζ|J |

, ηy =xξzζ − xζzξ|J |

, ηz =xζyξ − xξyζ|J |

,

ζx =yξzη − yηzξ|J |

, ζy =xηzξ − xξzη|J |

, ζz =xξyη − xηyξ|J |

.

(9)

The governing Navier-Stokes equations are transformed from the physical domain to the computationaldomain using the above transformations. The end result is,

∂Q

∂t+∂F

∂ξ+∂G

∂η+∂H

∂ζ= 0, (10)

where the transformed variables are,

Q = |J | ·Q F = F i − F v,G = Gi − Gv, H = Hi − Hv,

(11)

F iGiHi

= |J |

ξx ξy ξz

ηx ηy ηz

ζx ζy ζz

·F iGiHi

, (12)

F vGvHv

= |J |

ξx ξy ξz

ηx ηy ηz

ζx ζy ζz

·F vGvHv

. (13)

4 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


Figure 2. Distributions of solution (circles) and flux (squares) points in a standard element for third-orderSD scheme.

C. Space Discretization

In the standard element, two sets of points are defined to reside inside the element, namely solution pointsand flux points. The solution unknowns are the conserved variables, which are stored at the solution points,while the flux values are stored at the flux points. Figure 2 illustrates the distribution of solution and fluxpoints in a standard element. In order to construct a degree (n− 1) polynomial, n solution points and(n+ 1) flux points are required. The solution points are the Gauss points given by,

Xs =1

2

[1− cos

(2s− 1

2n· π)]

, s = 1, 2, ..., n, (14)

and the flux points are the Gauss-Lobatto points given by,

Xs+1/2 =1

2

[1− cos

( sn· π)], s = 0, 1, ..., n. (15)

To reconstruct the solution, the n solution points are used to build an (n− 1) degree polynomial with theLagrange basis,

hi(X) =

n∏s=1,s6=i

(X −Xs

Xi −Xs

). (16)

Similarly, the (n+ 1) flux points are used to build an n degree polynomial for the flux with,

li+1/2(X) =

n∏s=0,s6=i

(X −Xs+1/2

Xi+1/2 −Xs+1/2

). (17)

Reconstruction of the solution is the tensor products of each one-dimensional polynomial built with theLagrange basis defined in equation (16),

Q(ξ, η, ζ) =

n∑k=1

n∑j=1

n∑i=1

Qi,j,k|Ji,j,k|

hi(ξ) · hj(η). · hk(ζ) (18)

Likewise, the reconstructed flux polynomials become,

F (ξ, η, ζ) =

n∑k=1

n∑j=1

n∑i=0

Fi+1/2,j,kli+1/2(ξ) · hj(η) · hk(ζ),

G(ξ, η, ζ) =

n∑k=1

n∑j=0

n∑i=1

Gi,j+1/2,khi(ξ) · lj+1/2(η) · hk(ζ),

H(ξ, η, ζ) =

n∑k=0

n∑j=1

n∑i=1

Hi,j,k+1/2hi(ξ) · hj(η) · lk+1/2(ζ).

(19)

5 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


The reconstructed fluxes are only element-wise continuous, but are discontinuous across the cell interfaces.Hence, a Riemann solver, such as the Rusanov10 or Roe flux,9 is used to compute the common flux at theinterfaces for the inviscid flux. This ensures both conservation and stability. The inviscid flux derivativecalculations are completed in the following manner: First the conserved variables at the solution points areinterpolated to the flux points using equation (17). Then the interior flux points are evaluated using thevalues of the conserved variables at the flux points, while the flux at cell interfaces is calculated using aRusanov or Roe flux to provide element coupling. Finally the derivatives of the fluxes at the solution pointsare found with, (

∂F

∂ξ

)i,j,k

=

n∑m=0

Fm+1/2,j,k · l′m+1/2(ξi),(∂G

∂η

)i,j,k

=

n∑m=0

Gi,m+1/2,k · l′m+1/2(ηj),(∂H

∂ζ

)i,j,k

=

n∑m=0

Hi,j,m+1/2 · l′m+1/2(ζk),

(20)

where l′ is the spacial derivative of the Lagrange polynomial. The solution is updated with equation (26).To evaluate the viscous flux, the conserved variables and their gradients are required. An average ap-

proach6 is employed and described below. In the physical domain, the gradients are computed with,

∇Q =∂Q

∂ξ∇ξ +

∂Q

∂η∇η +

∂Q

∂ζ∇ζ. (21)

Let ~Sξ = |J |(ξx, ξy, ξz), ~Sη = |J |(ηx, ηy, ηz), and ~Sζ = |J |(ζx, ζy, ζz). Then, F = ~f · ~Sξ, G = ~f · ~Sη, and

H = ~f · ~Sζ with ~f = (F,G,H). The following identity is used,

∂~Sξ∂ξ

+∂~Sη∂η

+∂~Sζ∂ζ

= 0, (22)

and the gradient of the conserved variables from equation (2.21) becomes,

∇Q =1

|J |

∂(Q~Sξ

)∂ξ

+∂(Q~Sη

)∂η

+∂(Q~Sζ

)∂ζ

. (23)

The derivatives along each coordinate direction are then computed with,∂(Q~Sξ

)∂ξ

j,k

=

n∑m=0

(Q~Sξ

)m+1/2,j,k

· l′m+1/2(ξ),

∂(Q~Sη

)∂η

i,k

=

n∑m=0

(Q~Sη

)i,m+1/2,k

· l′m+1/2(η),

∂(Q~Sζ

)∂ζ

i,j

=

n∑m=0

(Q~Sζ

)i,j,m+1/2

· l′m+1/2(ζ).

(24)

The steps to evaluating the viscous fluxes is given here: First the conserved variables are computed at theflux points and the average of the solutions at cell interfaces is found with,

Q =QL +QR

2. (25)

Then the gradients of the solutions at the flux points is computed with the solutions found at the flux points.This can be completed in two ways. One can find the gradients at the solution points with equations (23)

6 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


and (24) with the derivative coefficients in equation (19) and then interpolate the gradients to the flux pointswith Lagrange interpolation, or the gradients can be computed directly at the flux points by evaluating thederivative coefficients at the points. Next, the average gradients at cell interface flux points is required andcomputed the same way as in equation (25). Finally, the viscous fluxes are computed at the flux points andtheir derivatives are found at the solution points with equation (20). Then the solution can be updated atthe solution points with,

∂Qi,j,k∂t

= −

(∂F

∂ξ+∂G

∂η+∂H

∂ζ

)i,j,k

. (26)

All time integration is completed with a third order Runge-Kutta explicit time stepping scheme.

III. CUDA Implementation of Spectral Difference

A. Brief Introduction of CUDA

GPUs were previously only used for calculating images shown on a computer screen. Recent developmentsby NVIDIA have enabled GPUs to complete more general problems with CUDA, calculating problems atmuch faster computing speeds than its CPU counterpart. In addition, the hardware and capabilities of GPUscontinue to enhance, enabling the devices to handle larger and more complicated problems. The backboneof CUDA lies within its architecture, which is detailed here. A GPUs streaming multiprocessor (SM) countdictates the number of tasks it can complete in parallel. Newer GPUs contain more SMs than previousgeneration cards, allowing faster execution speeds. These tasks are known as blocks, and when a GPUfunction (or kernel) is launched, the GPU forms a grid, composed of blocks, who in turn contain a numberof threads. The dimension of the grid can be either one or two-dimensional and the dimension of a block canbe either one, two, or three-dimensional. Block indexing is controlled by the CUDA command blockIdx.x fora one-dimensional block and the additional blockIdx.y for a two-dimensional block. Similarly, the threadsare controlled by threadIdx.x, threadIdx.y, and threadIdx.z (depending on the block dimension).

In addition to grids, blocks, and threads, multiple memory types exist in GPU computing. This paperconsiders global, texture, local, and shared memory types. Memory copied directly to the GPU is storedinto global memory. All blocks and threads can access this memory, but it requires coalesced access8 for bestperformance and computations in this memory is not ideal. Global memory is bound to texture memory, andonce a certain global memory is updated, its corresponding texture memory is updated. Texture memoryallows coalesced fast access by all blocks and threads, thus it is preferable to read all needed data fromtexture memory into a threads local memory or shared memory. Calculations within a GPU kernel shouldbe completed within local or shared memory to achieve the best performance. Local memory is local tothe thread and requires coalesced access, but calculations are quick. The final memory, shared memory,allows threads within a block to share memory or re-order the data if necessary. More information regardingCUDA, memory, and optimizations can be found in the NVIDIA CUDA C Programming Guide and CUDAC Best Practices Guide. 7,8

B. CUDA Implementation

The cells throughout the domain are hexahedral elements, hence all the operations are carried out in a

one-dimensional manner. Consider the calculation of the flux derivative,(∂F∂ξ

), with a CPU C++ code as

shown in algorithm 1. For one cell in the domain, the flux derivative is updated at each solution point, andits value at each solution point requires data from every flux point in that direction. The algorithm must berepeated to compute the flux derivative at all cells.

Now consider the GPU code in algorithm 2, which is slightly different from the CPU code in algorithm 1.On the GPU code, each cell is a block, meaning the dimension of the grid is equal to the total number of cellsin the domain. The threads per block are set to be solution points in cells. For example, in a third order SDscheme with ten cells in the domain, nsp = 3, nfp = 4, and nc = 10. With these dimensions, there exist tenblocks (blockIdx.x = 0, 1, 2, 3..., 9), with 27 threads per block (threadIdx.x = 0, 1, 2, threadIdx.y = 0, 1, 2,and threadIdx.z = 0, 1, 2). CUDA dimensions start at index 0. Each thread in each block independentlycalculates the flux derivative on its corresponding solution point in the domain. Of course, using only 27threads per block is wasting GPU resources, but the code presented is merely for understanding purposes.In the developed solver, multiple cells are calculated per block, maxing the number of threads available

7 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


Algorithm 1 CPU calculation of the flux derivative for one cell

for k = 0 to nsp dofor j = 0 to nsp do

for i = 0 to nsp do. Initialize the array∂F∂x [i][j]k] = 0for m = 0 to nfp do

. Compute the derivative∂F∂x [i][j][k] = ∂F

∂x [i][j][k] + F [m][j][k] ∗ l[i][m]end for

end forend for

end for

Algorithm 2 GPU calculation of the flux derivative (not optimized)

i = threadIdx.xj = threadIdx.yk = threadIdx.zcell = blockIdx.x. Initialize the arrayid1 = i+ j ∗ nsp + k ∗ nsp ∗ nsp∂F∂x l

[id1] = 0for m = 0 to nfp do

. Compute the derivative in local memoryid2 = m+ j ∗ nfp + k ∗ nfp ∗ nsp + cell ∗ nfp ∗ nsp ∗ nsp∂F∂x l

[id1] = ∂F∂x l

[id1] + Fg[id2] ∗ lg[m+ i ∗ nfp]end for

per block. Additionally, algorithm 2 pays no attention to memory access, and everything is calculated inglobal memory (hence the subscript g) and stored into the local memory of each thread (the subscript l).To optimize the algorithm, specific GPU memory types are manipulated in the calculation as shown inalgorithm 3.

Comparing both algorithms, the first major difference is the threads are set as the flux points in algorithm3 instead of the solution points as in algorithm 2. Hence, for a third order scheme, threadIdx.x = 0, 1, 2, 3,threadIdx.y = 0, 1, 2, 3, and threadIdx.z = 0, 1, 2, 3. This enables the threads to read in appropriate datafrom texture memory into shared memory. Shared memory is needed to accelerate the calculation of theflux derivative which requires the for loop through the flux points. Once the shared memory is loaded, thethreads need to be stopped to ensure all data is loaded. This is completed with the syncthreads command.Next, the threads switch from operating on flux points to solution points with the if-statements. The fluxderivative is then computed like that in algorithm 2, but the computations are completed in shared memoryand stored in local memory.

Between the two GPU algorithms, the amount of code has more than doubled in the optimized version.However, given a large enough domain, the GPU code in algorithm 3 out-preforms the GPU code in algorithm2 by more than two times, resulting in a much faster calculation. For the three-dimensional SD method,we again allow each block to be a cell, and in some cases allow multiple cells to be calculated per block.For threads, we allow each flux point in each direction to be a thread, then the code can freely switch tosolution points in any direction. Thus, the grid is still one dimensional as in the examples presented, but theblocks are three dimensional, with threads reconstructing the solution in all three directions per cell. Thisallows for huge performance increases in speed when compared to the CPU SD code version, which will bediscussed later in section 5.

8 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


Algorithm 3 GPU calculation of the flux derivative (optimized)

i = threadIdx.xj = threadIdx.yk = threadIdx.zcell = blockIdx.x. Allocate shared memory

shared double ls[nfp ∗ nsp]shared double Fs[nsp ∗ nsp ∗ nfp]

. Load shared memory from texture memoryif j < nsp then

ls[i+ j ∗ nfp] = lt[i+ j ∗ nfp]end ifif k < nsp then

if j < nsp thenid = i+ j ∗ nfp + k ∗ nfp ∗ nspFs[id] = Ft[id+ cell ∗ nsp ∗ nsp ∗ nfp]

end ifend if

syncthreadsif k < nsp then

if j < nsp thenif i < nsp then

. Initialize the arrayid1 = i+ j ∗ nsp + k ∗ nsp ∗ nsp∂F∂x l

[id1] = 0for m = 0 to nfp do

. Compute the derivative in local memoryid2 = m+ j ∗ nfp + k ∗ nfp ∗ nsp∂F∂x l

[id1] = ∂F∂x l

[id1] + Fs[id2] ∗ ls[m+ i ∗ nfp]end for

end ifend if

end if

9 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


IV. CUDA Verification

A. Accuracy study with isotropic vortex propagation

To verify the Euler equations of the CUDA SD code, both grid-refinement (h-refinement) and order-refinement (p-refinement) were studied with the propagating isotropic vortex problem. This problem hasan analytical solution and was used by Shu.11 The mean flow is given by (ρ, u, v, w, p) = (1, 1, 0, 0, 1). Anisotropic vortex is then added to the flow with perturbations in u and v velocity and the temperature givenby,

(δu, δv, δw) =ε

2πe0.5(1−r

2)(−y, x, 0),

δT = − (γ − 1)ε2

8γπ2e1−r

2

.(27)

No perturbations are added for the w velocity or the entropy. In equation (27), r2 = x2 + y2 andthe vortex strength ε = 5. The computational domain is [-5,5] x [-5,5] x [-5,5]. In the x and y-directions,characteristic inflow and outflow is chosen as the boundary conditions, with a symmetric boundary conditionin the z-direction. The above initial conditions yield an exact solution to the Euler equations which moveswith speed (1,0,0) in the x-direction.

The h-refinement study was completed on four different mesh sizes. The order of accuracy, n, was setto four. Figure 3(a) shows the time-independent errors between the numerical solution and the analyticalsolution in L∞, L1, and L2 norms. All norms were calculated at time t = 1.0, and a three-stage Runge-Kuttascheme was used for time integration.

The p-refinement study was completed on a coarse grid [10x10x1] (100 cells). The order of the polynomialbasis was increased after each completed simulation. Again all norms were calculated at time t = 1.0. Figure3(b) shows the numerical errors and illustrates that an exponential decay of error with respect to the orderof accuracy is achieved.

(a) Solution errors with h-refinement for isotropic vortex. (b) Solution errors with p-refinement for isotropic vortex.

Figure 3. Grid and order refinement for vortex propagation.

B. Accuracy study with Couette flow

Couette flow is used for the viscous validation of the SD CUDA code. It is an analytical solution to theNavier-Stokes equations, and models viscous flow in the positive x-direction. There are two parallel plates,one at y = 0 and another at y = h, with temperatures T0 and Th respectfully. In addition, the plate at y = 0is fixed while the plate at y = h moves with speed U . If the viscosity coefficient µ is constant, this problem

10 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


has the exact solution as follows:

u =U

h, v = 0, w = 0,

T = T0 +y

h(Th − T0) +

µyU2

2kh

(1− y

h

),

p = constant, ρ =p

RT

(28)

In equation (28), k is the thermal conductivity and R is the gas constant. For the current computation,U = 1.0, h = 2, T0 = 0.8, Th = 0.85, and µ = 0.01. The computational domain is [0,4] x [0,2] x [0,4]. Ap-refinement study was completed on a course grid [2x2x1] with 4 cells. The order of the polynomial basiswas increased after each completed simulation. Figure 4 shows the numerical errors of L∞, L1, and L2 normsand again the exponential decay of these errors is observed. The three-stage Runge-Kutta scheme was usedfor time integration.

Figure 4. Solution errors with p-refinement Couette flow.

C. Acoustic pressure pulse

To study the effects of high-order on the SD scheme, an aero-acoustic problem is chosen. Consider a pressurepulse started in the center of the domain given with the following equation,

p = p∞ + εeln2(x−xc)

2+(y−yc)2

b2 . (29)

In equation (29) b = 0.2, ε = 0.1, and xc = yc = 0. The computational domain was taken as [-10,10] x[-10,10] which contained 900 cells. Characteristic outflow conditions were implemented along boundaries in xand y-directions and a symmetric boundary condition in the z-direction. Near the boundaries, the grid wasstretched as described by Visbal and Gaitonde16 to not contaminate the genuine solution. A discussion ofthis grid stretching in the one-dimensional manner was presented by Vichnevetsky.13 This problem requiressolving the Euler equations only, meaning Fv = Gv = Hv = 0. Figure 5(a) shows the pressure contours attime t = 4.5 seconds for a sixth-order SD scheme. At this time, the pressure pulse was still sufficiently farfrom the computational boundaries, hence the effects of boundary conditions were diminished. Figure 5(b)shows the pressure along the centerline for a second, third, fourth, and fifth order SD scheme (the sixth orderand fifth order were found to give identical results). The results demonstrate that second and even thirdorder schemes have issues with this simulation. However, fourth and fifth orders appear to be convergingtowards an acceptable solution, with only a slight difference in the results at the peaks just before x = −5and just after x = 5.

D. Acoustic pulse and cylinder

The following case looks at another pressure pulse as it interacts with a cylinder. Equation (29) is again usedto start the initial pulse with b = 0.2, ε = 0.1, xc = 4, and yc = 0. The computational domain is taken to

11 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


(a) Pressure contours (b) Pressure taken at centerline (y = 0)

Figure 5. Pressure pulse results.

be [-15,15] x [0,15]. The symmetry condition is employed along y = 0 and characteristic outflow conditionswith grid stretching along y = 15 and in both x directions. The domain contains 2074 cells and the casewas run for second, third, fourth, and fifth order accurate. This problem, like the one before, requires onlysolving the Euler equations. The cylinder was taken to have radius r = 1 and data was recorded at threelocations, A, B, and C, all at r = 5 and θ = 90, 135, and 180 degrees respectfully, where the exact solutionis known. Figure 6 shows the pressure contours of the case at two different times for fourth order. Figure7 illustrates the pressure disturbance, p′, histories, where p = p∞ + p′. From the figures it is clear thatsecond order methods cannot capture the effects completely, even third order has difficulties. Fourth andfifth orders however exhibit very good results, matching well with the exact solution at all three points,further demonstrating the accuracy of the SD CUDA code.

(a) t = 3 seconds (b) t = 7.8 seconds

Figure 6. Pressure contours for acoustic cylinder case.

E. SD7003 wing

The final simulations compares results from two different solvers: The SD CUDA solver against FDL3DI,14,15

a high-order finite difference code. The case is from the 1st International High-Order Workshop and consistsof an SD7003 wing at 4 degrees angle of attack with incoming flow of 0.1 and Reynolds number of 60,000.The case was run for 20 convective times, 12 of which were time-averaged.

The results in figure 8(b) show the mean u-velocity streamlines. Data was taken tangent to the airfoilsurface at chord positions of 0.1c, 0.2c, 0.3c, ... , and 0.9c. In figure 8(a), the time averaged pressurecoefficient is compared between the two solvers. SD CUDA shows good agreement with FDL3DI in bothplots. The slight disagreement can be attributed to the boundary conditions in the span-wise direction,

12 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


(a) Point A (b) Point B

(c) Point C

Figure 7. Pressure disturbances at several points for acoustic cylinder case.

which was symmetric for CUDA SD but periodic or cyclic for FDL3DI.

V. CUDA Acceleration

An important aspect of GPU programming is the speed-up, or computational speed increase, whencompared to similar CPU codes. This section compares the speeds of the SD method on GPUs and CPUs.To compare the two, the number of CPUs will equal the number of GPUs. Meaning, if a CPU contains atotal of eight cores, then all eight cores will be run, and compared to one entire GPU. The CPU code iswritten in FORTRAN, while the GPU code is in CUDA C++. Both codes are compiled with the appropriateoptimization flags to ensure best performance for comparison. In all cases, the CPU used was an Intel Xeon2.27 GHz which is a quad core with hyper threading. Different GPUs were employed for the cases, and theywill be described within each speed-up study.

A. Isotropic vortex

The GPU in this case was a Tesla C1060 with 240 CUDA cores, while the CPU ran with all cores activated.This case demonstrates less than optimal speeds, due to the cyclic boundary condition required, which is

13 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


(a) Mean pressure coefficient of CUDA SD 3rd order (blue),2nd order (green) and FDL3DI (red).

(b) Time averaged span-wise u-velocity of CUDA SD (blue)and FDL3DI (red)

(c) Time averaged u-velocity contours

Figure 8. SD7003 results.

Type / Order 3rd 4th 5th

CPU 0.011 0.026 0.055

GPU 0.001433 0.002498 0.004461

Table 1. Isotropic vortex timings (seconds per iteration).

expensive for the GPU implementation. The remaining cases will show improved performance with theabsence of the cyclic boundary condition. Table 1 illustrates the time spent in seconds per iteration for thetwo solvers and figure 9 shows the speed-up results.

B. Acoustic pulse and cylinder

Type / Order 2nd 3rd 4th 5th

CPU 0.205342 0.576190 1.233750 2.345520

GTX 0.008277 0.018696 0.041478 0.074551

Tesla 0.008327 0.019119 0.032121 0.066518

Table 2. Acoustic pulse and cylinder timings (seconds per iteration).

The second test case shows very promising results. Two different GPUs were tested here, a GTX 550TIand a Tesla C2070, to demonstrate the power of low and high-end GPUs. The GTX card is a gaming graphicscard with CUDA enabled and 192 CUDA cores, while the Tesla C2070 is designed as a supercomputer andcontains 448 CUDA cores. In addition, the GTX card is very affordable when compared to the Tesla card,

14 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


however the Tesla card contains much more memory. Table 2 shows the timing for the CPU and two GPUs inseconds per iteration, while figure 10 demonstrates the speed-up. The GPU code sees a maximum speed-upof over 38 times faster than its CPU counterpart. It should be noted that for the fifth order run, the CPUcode took just over three and a half hours to complete, while the GPU code finished in a mear six minutes.The two GPUs have comparable speeds, with the Tesla overtaking the GTX as order of accuracy increases,demonstrating the power that lower-end cards have for numerical computations.

C. SD7003 wing

Type / Order 2nd 3rd

CPU 607.78 hours 3483.29 hours

GPU 27.68 hours 137.14 hours

Table 3. SD7003 wing timings (total time for simulation).

Figure 9. Vortex propagation speed-up.

Figure 10. Acoustic cylinder case speed-up with two GPUs.

The final case shown uses four Tesla cards and compares them against 4 CPUs with all 32 cores activated.Whereas 32 cores for a third order solution would have finished in about 145 days, the GPUs can completeit in less than a week, yielding a 25 times faster computation time. Hence we observe to usefulness for GPUs

15 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


Figure 11. SD7003 wing speed-up.

to handle large CFD problems. Again, figure 11 and table 3 show the speed-up results and total simulationtime for CPU and GPU solvers respectfully.

VI. Conclusions and Future Work

Computing with GPUs appears to be a very promising solution to decrease the computational costand time of CFD solvers, out preforming its CPU counterpart in the present study by twenty to fortytimes. To achieve these results, however, a complete rewrite of the SD code was required. In addition, thesize of computations is limited by the memory on the GPU, which is only 5 to 6 gigabytes on higher-endcards. However, with huge increases in performance over CPUs, GPUs save computation time. Additionally,computations can be completed over GPU workstations as opposed to CPU servers, saving space and moneywithout losing computational power. This alone makes GPU computing a viable solution for large-scaleCFD simulation.

GPU codes such at the one presented will continue to be developed. A more recent numerical method,Correction Procedure via Reconstruction (CPR),4 requires fewer operations than the SD method and hasrecently been implemented in two dimensions for GPU CUDA computing demonstrating excellent results.CUDA will be used convert a three dimensional CPR Navier-Stokes solver from CPU to GPU computingwith the intent to improve speed performance by orders of magnitude.

Acknowledgments

The authors would like to acknowledge the support from the Air Force Research Labs at Wright-PattersonAirforce Base and the Department of Aerospace Engineering, Iowa State University.

References

1J. Barth & P. O. Frederickson. High-order solution of the Euler equations on unstructured grids using quadratic recon-struction. AIAA paper, 1990-0013, 1990.

2A. Corrigan, F. Camelli, & Lohner. Running unstructured grid based CFD solvers on modern graphics hardware. AIAApaper, 2009-4001, 2009.

3M. Delanaye & Y. Liu. Quadratic reconstruction finite volume schemes on 3D arbitrary unstructured polyhedral grids.AIAA paper, 1999-3529-CP, 1999.

4G. Haiyang & Z. J Wang. A residual-based procedure for hp-adaptation on 2D hybrid meshes. AIAA paper, 2011-492,2011.

5D. A. Jacobsen, J. C. Thibault, & I. Senocak. MPI-CUDA implementation for massively parallel incompressible flowcomputations on Multi-GPU clusters. AIAA paper, 2010–522, 2010.

6D. A. Kopriva. A staggered-grid multidomain spectral method for the compressible Navier-Stokes equations. Journal ofComputational Physics, 143:125-158, 1998.

7NVIDIA. CUDA C Best Practices Guide. Ver. 4.1.

16 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


8NVIDIA. NVIDIA CUDA C Programming Guide. Ver. 5.0.9P. L. Roe. Approximate Riemann solvers, parameter vectors, and difference schemes. Journal of Computational Physics,

43:357-372, 1981.10V. V. Rusanov. Calculation of interaction of non-steady shock waves with obstacles. Journal of Computational Physics,

USSR, 1:267-279, 1961.11C. W. Shu. Essentially non-oscillatory and weighted essentially non-oscillatory schemes for hyperbolic conservation laws,

in: A. Quarteroni (Ed.), Advanced Numerical Approximation of Nonlinear Hyperbolic Equations, Lecture Notes in Mathematics,Springer-Verlag, Berlin / New York, 1697:325, 1998.

12J. C. Thibault & I. Senocak. CUDA implementation of a Navier-Stokes solver on MultiGPU desktop platforms forincompressible flows. AIAA paper, 2009-758, 2009.

13R. Vichnevetsky. Propagation through numerical mesh refinement for hyperbolic equations. Mathematics and Computersin Simulation, 23:344, 1981.

14M. R. Visbal & D. V. Gaitonde. High-order accurate methods for complex unsteady subsonic flows. AIAA paper, 37:1231-1239, 1999.

15M. R. Visbal & D. V. Gaitonde. High-order schemes for Navier-Stokes equations: Algorithm and implementation intofdl3di. Technical Report AFRL-VA-WP-TR-1998-3060, Air force Research Laboratory, Wright-Patterson AFB (1998).

16M. R. Visbal & D. V. Gaitonde. Very high-order spatially implicit schemes for computational acoustics on curvilinearmeshes. Journal of Computational Acoustics, 9:1259-1286, 2001.

17Y. Zhou & Z. J. Wang. Implicit large eddy simulation of transitional flow over a SD7003 wing using high-order spectraldifference method. AIAA paper, 2010-4442, 2010.

17 of 17


Dow

nloa

ded

by U

NIV

ER

SIT

Y O

F K

AN

SAS

on J

uly

1, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/6

.201

3-29

41


high-order spectral difference: verification and acceleration using

Documents