vectorized unstructured cfd methods for gpu computing

American Institute of Aeronautics and Astronautics

1

Vectorized Unstructured CFD Methods for GPU Computing

Gregory D. Howe1 Georgia Institute of Technology, Atlanta, GA, 30332

Vectorized formulations of common unstructured-grid CFD methods are presented.

These are used to produce a Matlab-based unstructured CFD code that will run on

traditional central processing units (CPUs) and graphics processing units (GPUs). GPU-

based computing systems are potentially the new future of affordable high-performance

scientific computing while unstructured CFD has increasingly become the standard for

aerodynamic design problems over the last twenty years. Multiple method components are

examined here to establish their suitability for this type of vectorized formulation and

computation. Example solutions are presented for classic geometry cases.

Nomenclature

A = flux Jacobian matrix a = speed of sound c = centroid (of a cell or face) Cp = pressure coefficient = (p - p∞) / (0.5ρ∞V∞

2) e0 = total energy per unit volume F = flux vector h0 = stagnation enthalpy k = thermal conductivity n = unit normal vector with components [nx ny nz] p = pressure R = residual vector ReL = Reynolds number = ρ∞V∞L / µ∞, where L is a reference length r = displacement vector between two points S = area (typically of a cell face)

U = conserved state vector = [ ]Tewvu 0ρρρρ

u,v,w = velocity x-, y-, and z-components

V = velocity vector = [ ]Twvu

x = coordinate vector of a point in space (typically a grid node) γ = ratio of specific heats (1.4 for air at moderate temperatures) κi = number of faces of cell i κ(i) = function returning face indices corresponding to cell i η(j) = direction of normal vector nj attached to face j ϕj = number of nodes of face j ϕ(j) = function returning node indices corresponding to face j ζk = number of cells which reference node k ζ(k) = function returning cell indices corresponding to cells referencing node k ξ(i) = function returning node indices corresponding to nodes defining cell i ψ(i,k) = function returning face indices corresponding to faces referencing node k and cell i ρ = density µ = dynamic viscosity Λ = flux Jacobian spectral radius σ = Courant-Friedrichs-Lewy (CFL) number Ψ = flux limiter function Ω = cell volume

1 Master's Student, AIAA Student Member.


2

Subscripts

1,2, etc. = refers to a subfunction which produces that number of outputs c = convective (as opposed to viscous) part i = refers to cell-centered or centroidal values j = refers to face-centered values k = refers to nodal values v = viscous (as opposed to convective) part L = "left-hand" value across a boundary R = "right-hand" value across a boundary

⊥ = perpendicular component. Typically computed via a dot-product with a normal vector. ∞ = denotes freestream quantities

I. Introduction

It is becoming increasingly apparent with time that the future of scientific computing lies with Graphics Processing Unit (GPU) computing technologies. In October 2010, the National University of Defense Technology in China announced the completion of Tianhe-1A, a new world's fastest supercomputer, clocking an extraordinary 2.507 petaflops (2.507x1015 floating-point operations per second). According to information from press releases, about 70% of this computer's processing power is from over 7,000 NVIDIA Tesla GPUs. NVIDIA claims this architecture to be three times as power efficient and twice as space efficient as a CPU-only architecture with the same performance. Numerous computer manufacturers have begun selling individual workstations containing four of the latest generation of Tesla GPU cards (commonly branded as "personal supercomputers"). These tiny clusters can potentially run at around 5 teraflops - equivalent to the speed of the fastest supercomputer in the world as recently as 2000. Reproducing this performance with CPUs would require on the order of 100 of Intel's latest processors - far too many to fit in a single workstation.

The drawback to the nature of GPU computing is that utilizing all of this potential requires extremely well-vectorized code. This is because GPUs consist of numerous small processing units that perform arithmetic operations extremely quickly but cannot be effectively controlled independently. What this effectively means to the programmer is that fast, efficient GPU code needs to contain as few loops and complex logical operations as possible and perform arithmetic operations on whole arrays of data at once, instead of element-by-element.

Unstructured CFD codes have increasingly been the preference of aerodynamic designers over the last twenty years due to the much-improved ease of grid creation over older structured methods. Complex 3-D geometries can be orders of magnitude faster to discretize with unstructured grids. However, unstructured codes are less obviously suited to the kind of vector processing done by GPUs. This is largely due to the amount of indexing and looping over unpredictable numbers of elements that is typically involved. This current work is an effort to control this problem and write an unstructured code in a vector-efficient manner.

II. Vectorization and GPU Programming

GPUs are fundamentally vector processors. Modern GPUs consist of potentially hundreds of small, cheap "stream processors." These were originally implemented in order to allow GPUs to process many independent graphical elements in parallel as a way to speed up generation of complex graphical effects. Programming APIs like NVIDIA's CUDA allow direct access to this processing power from within more general C programs in order to very quickly perform identical operations on elements of huge arrays.

The fundamental principle of writing vectorized code is to minimize control flow operations wherever possible so that every element of a large array is performing the same action. Control flow operations include anything which requires the computer to make a decision about whether to jump to a different line of code or not. This obviously includes looping commands ("for," "while," or "do" loops) but can also extend to many types of "if-then" statements.

A. Use of Matlab as a Development Environment

Many CFD programmers (and indeed many C and Fortran programmers in general) would deride Matlab as simply too slow a programming language to be useful for writing a CFD solver. This is not a wholly false statement. Matlab is an interpreted (as opposed to compiled) language, essentially meaning that source code is converted to machine code at execution time rather than beforehand. This does inherently slow the computation somewhat, especially with complex control statements. However, Matlab does an extremely good job of using precompiled, optimized libraries to do individual vectorized operations. This means that if programs are written in a completely


3

vectorized fashion, they can come close to the maximum level of performance obtainable with a lower-level programming language. The control flow around a mathematical operation may be slower than in another language, but the mathematical operation itself should be very nearly as fast as possible in any other setting. Matlab also has advantages in terms of ease of programming that generally result from its status as a higher-level programming language than a language like C or Fortran.

All of this combines to make Matlab the ideal development environment for vectorized algorithms. The language treats all variables as arrays and does virtually all calculations as vector operations when instructed properly. Even the fact that iterative scalar algorithms are slow in Matlab can be seen as an advantage. This means that Matlab code optimized for runtime will be close to optimized for runtime in vectorized hardware setting such as a GPU.

B. GPU Programming through Matlab

Other researchers in scientific computing have noticed the potential parallels between the vectorization of Matlab and that of a GPU programming language. To this end, several different groups have been working on Matlab interfaces to CUDA, NVIDIA's GPU programming language. These include the Parallel Computing Toolbox from Mathworks itself1, Jacket by Accelereyes2, a third-party company, and GPUmat

3, an open-source project. In all of these implementations, Matlab's object-oriented framework is used to create new data types that define "GPU arrays." These arrays are stored on the GPU and feature overloaded mathematical functions so that standard operations performed on these arrays are performed on the GPU. This makes the differences between programming in standard Matlab and CUDA-accelerated Matlab nearly invisible in many cases. Well-vectorized Matlab code can be translated to code running on the GPU in a small fraction of the amount of time it takes to alter C code to make use of the CUDA libraries. (For reference, the procedure of translating existing Fortran code without creation of some very complex libraries would really have to start with translating the code to C as CUDA is really only designed to interface with a C environment).

The true beauty of this environment within Matlab is that it makes nearly all of the management of the GPU itself and the transfer of data between GPU and system memory nearly transparent to the programmer. There are a few programming oddities that work slightly differently on the GPU, but making a "rough draft" program that utilizes the GPU takes almost no additional effort and even less knowledge about how the GPU works, provided that the original Matlab code was well vectorized. Subtleties such as the few operations that would actually run faster on a CPU take a bit longer to learn and optimize, but these are generally more minor performance improvements.

All of the GPU code that was developed for the project was developed in Jacket. Accelereyes' product is considerably ahead of the other related projects in the amount of Matlab functionality that it supports. (Additionally, Accelereyes offers a "student license" option that made obtaining a valid license for the product much more affordable for a graduate student researcher.)

C. Vectorized Arithmetic and Functions

The key concept in a vectorized program is that there should be as few loops (for/do/while/etc) as possible. In a program like a CFD solver, this ideally means that the only loop in the code should be the steps of the time integration scheme. Some integration schemes, such as a Runge-Kutta scheme, will require addition of an extra loop or two at this sort of large level to deal with subiterations. A key feature then is that all mathematical operations need to operate on entire arrays at once without the need for looping.

The general description of vector operators are operators that take in arrays of arbitrary size, perform some operation on each element and then return an array of the same size. The simplest examples of this are single-input functions (e.g. sine, cosine, or the exponential function). These functions will take in a single array and return an array of the same size. The other common example is a operator that takes in two or more arrays of identical size and uses the elements of these arrays in a one-to-one correspondence to produce an output array of the same size. This is the case with arithmetic operators (+, - , *, /).

D. Vectorized Indexing and Logical Vectors

An important feature employed for the vectorization of a whole code such as a CFD solver is the idea of vectorized indexing. This is the ability to take an array, index it by an integer array of potentially wildly different size, and produce an output array of the size of the index array. For example, imagine a 10x1 vector Y. A second vector Z with only the even-index terms can be obtained easily with the vector indexing expression Y([2 4 6 8 10]T). This type of indexing is the essence of how the complex connectivity of an unstructured grid can be dealt with in a vectorized fashion.


4

This idea of vectorized indexing is familiar in the Matlab environment but may seem foreign to most C or Fortran programmers. However, it is especially important in a GPU programming environment. With the way CUDA is implemented on a GPU, many memory accesses can actually be "coalesced" into a much smaller number of instructions to the device's memory controller. In order for this to occur, however, all of these memory access commands must be issued at the same time. The benefits of coalescence of memory accesses varies heavily by the pattern involved, but it is an active area where NVIDIA is focusing improvements in the CUDA architecture.

A common way to prevent the use of "if-then" control statements is by vectorized indexing with logical vectors. An array of "true" and "false" values is created with a vectorized Boolean expression. This array is then used to index another vector (of the same size) to return only the values which correspond to "true" indices. For example, rather than looping through all of the faces of an unstructured meshes looking for which ones have a far-field boundary condition, a programmer can use a vectorized command to create a logical vector which can then be used to index the overall vector of faces to produce the vector of just far-field faces.

E. Less-Vectorized Functions

There are a few additional operations which can be performed in a more vector-friendly format even if they are not precisely fully vectorized.

One example of this is arithmetic operations on two arrays of different sizes. This can be allowed if one of the arrays has a size of one in all of the dimensions in which the two arrays differ in size. The smaller matrix in this size is essentially "duplicated" along that dimension to perform the operation. The benefit over actual duplication is that additional memory is not needed to store redundant information. For example, imagine the computation of face area vectors from unit normal vectors and scalar face areas:

jjj S nS = (1)

The inputs Sj and nj are arrays with the same number of rows (the total number of faces), but Sj has only one column, whereas nj has three columns. Applying the distributive property of multiplication is perfectly natural here to produce the three-column output array Sj. (This is accomplished within Matlab via the built-in "bsxfun" function).

Another example of vector-friendly functions are operations which shrink one dimension of an array to unity. The most common examples of this are "sum" and "product" functions, but other instances are possible. In programming syntax, these functions will all take in a dimension on which to operate in addition to their array input. For example, imagine the computation of the normal component of a velocity vector:

zyxj wnvnunV ++=⊥, (2)

If the velocity and normal vectors were stored as the n-by-3 arrays Vj and nj, this computation could be done in Matlab with the command "sum(Vj .* nj , 2)". An element-by-element multiplication takes place between the two input arrays and then the product is summed along its second dimension, producing an n-by-1 array of normal velocity components. The sum is not a completely vectorized operation, but for large arrays it will run far faster on a vector processor than looping through the array.

III. Unstructured Grid Geometry

In general, the computation of most grid-based information is not particularly relevant to the topic of fast vectorized computation because it only has to be evaluated once at the beginning of a solver run. These types of data include parameters like cell centroids and volumes and face areas and normal vectors. The items that are discussed here are more unique to this vectorized formulation.

A. Terminology

Throughout this work, a standard set of terms will be used to refer to the geometrical elements of an unstructured grid, whether it be two- or three-dimensional. This terminology is fairly standard for 3-D grids, but it is important to clarify here for consistency in the 2-D case. 1. Cells

The term cell is always used to refer to the highest-dimensional geometrical construct in the grid. In 3-D this corresponds to polyhedron whereas in 2-D this corresponds to polygons. Cells are always said to have volume, even


5

if they are two-dimensional (in which case this "volume" is actually surface area). Throughout this work, cells are referenced by the subscript index i. The variable Ωi is used to refer to the volume of the cell i. 2. Faces

The term face is always used to refer to the components which join together to create a cell. In 3-D these are polygons while in 2-D these are line segments. Faces are always shared by at most two cells (boundary faces are referenced by only one cell). Faces are always said to have surface area, even if they are one-dimensional (in which case this "surface area" is actually length). Faces also always have definable normal vectors to describe their orientation in space. Throughout this work, faces are referenced by the subscript index j. This index varies from 1 to the total number of faces in the entire grid (i.e. it is not redefined for each cell). The variable Sj is used to refer to the surface area of the face j. 3. Nodes

The term node is always used to refer to an (x, y, z) point in space. A group of nodes (in a particular order) defines a face or cell. Nodes may be a part of any number of cells or faces. Throughout this work, nodes are referenced by the index k. This index varies from 1 to the total number of nodes in the entire grid (i.e. it is not redefined for each cell or face). The variable xk is used to refer to the coordinates of the node k. 4. Edges

In the event that the line segments of a 3-D mesh must be referenced, they are referred to as edges. Care should be taken, however, not to confuse them with the faces of a 2-D grid. While the two are geometrically the same, they function quite differently within a flow solver. Edges would be relevant in a node-centered scheme, but as the scheme described here is cell-centered, they are not explicitly needed.

B. Connectivity Functions

A series of functions is identified here to define the way that the different elements are constructed from each other. All of these are shown as functions which take in an integer index and return a vector (potentially of variable length) of integer indices. In coding practice, however, all of these "functions" are simple table lookups into matrices of integers. In typical usage, the row number of the matrix is the input ("independent") variable and the values on that row are the output ("dependent") variables. Appendix A shows a small sample grid and the example connectivity matrices for that grid.

1. Cell-to-Node Connectivity: ξ(i)

The function ξ(i) defines the connectivity of cells to nodes. The outputs of ξ(i) have variable, but well-defined length. For example, ξ(i) will always return three indices for a triangular cell, four indices for a quadrilateral or tetrahedral cell, etc. The cell indices i are typically arranged such that the length of ξ(i) is monotonically increasing as i increases. That is, all triangular cells are referenced before quadrilateral cells in 2-D or all tetrahedral cells are referenced before pyramidal cells in 3-D. In the event of multiple cell geometries with the same number of nodes (such as a triangular prism and a pentagonal pyramid, each with six nodes), the cells with the smallest number of faces are referenced first (e.g. the prism with five faces before the pyramid with six). This allows for the connectivity information to be broken up into multiple subfunctions without the need to store additional indexing information. For example, in 2-D:

( )( )( )

+≤<

≤=

QTT

T

NNiNi

Niii

,

,

4

3

ξξ

ξ

(3)

Where NT and NQ are the total number of triangular and quadrilateral elements, respectively. Thus for this grid topology (consisting of only triangular and quadrilateral cells), ξ(i) is stored as two matrices: one NT-by-3 and one NQ-by-4.

This connectivity function is typically assumed to be the only one that is known beforehand, as it can be used to completely define a computational mesh from a cloud of nodes.

2. Face-to-Node Connectivity: ϕ(j)

The function ϕ(j) defines the connectivity of faces to nodes. In a 2-D grid, ϕ(j) has length 2 for all j. On a 3-D grid, there may be more variation, but even on a grid defined with tetrahedra, quadrilateral pyramids, triangular prisms, and hexahedra, ϕ(j) will only consist of two subfunctions: ϕ3(j) and ϕ4(j).


6

ϕ(j) is essentially created by the reorganization and sorting of ξ(i). A large matrix is created (for each subfunction of ϕ(j)) by concatenating each face implicitly defined by ξ(i) such that the rows of this matrix are the indices of the nodes in faces. Each of these rows is sorted in an order-preserving fashion (to avoid altering the topology of polygonal faces and turning a convex quadrilateral into a self-intersecting one) such that the faces that are referenced multiple times are guaranteed to appear identically each time. The rows of the matrix are then sorted (with a typical priority of first column, then second column, etc.). This places identical rows adjacent to one another so the duplicates can be removed.

3. Cell-to-Face Connectivity: κ(i) and κ-1(j)

The function κ(i) defines the connectivity of cells to faces. The lengths of the outputs of κ(i) depend on the topology of the cells defined by ξ(i). If the cell indices are properly ordered, this will also be a monotonically-increasing function that can also be deterministically broken up into subfunctions, thus eliminating the need for an additional mapping function. Note that the inverse function κ-1(j) always has either one (for boundary faces) or two (for internal faces) outputs because faces always split at most two cells.

Computing κ(i) is a somewhat complex process, but it is not overly time-consuming and need only be done once for each grid. First, the inverse function κ-1(j) is created during the creation of ϕ(j). When the rows of ϕ(j) are sorted (before duplicate faces are removed), the change in the original indices is tracked. Each pair of identical rows of this intermediate ϕ(j) matrix came from the two cells that share the face j. Because not much memory is wasted by doing so, boundary faces are typically handled by setting the second one equal to some non-indicial value such as 0 or "NaN" to preserve the rectangular shape of the array. The inversion κ-1(j) to produce κ(i) is performed by a similar sorting algorithm to the formation of ϕ(j). Each value of κ-1(j) is the row number of a row of κ(i) containing the corresponding value of j. Optionally, κ(i) can be stored in "signed" form, where the index has a negative value if it came from the second column of κ-1(j). This means that in addition to knowing which faces are connected to a cell, it is known whether the cell was the "right-hand" or "left-hand" cell adjacent to that face.

C. Normal Vectors and Facing

Numerous equations in an unstructured formulation require the use of a face normal vector. Standard practice defines these normal vectors as pointing out of the cell. However, it is desired to separate computations on faces from either adjacent cell as being the "current" cell. This allows all edge-based computations to be done once and then this resultant value be applied to both adjacent cells as appropriate. For this reason, a function η(j) is defined having the same shape as κ-1(j). η(j) has a value of 1 when nj points out of the corresponding cell and a value of -1 when nj points into the corresponding cell. Thus, the product η(j)nj always points out of the cell. η(j) should return a value of 0 or NaN in the same places as κ-1(j) does.

One method of computing the function η(j) is by taking the sign of the dot product of the normal vector and the vector between the cell and face centroids. (Note that this formulation is only strictly valid for convex cells):

( ) ( )( )( )( ) jjj

jjjj

ncc

ncc

⋅−

⋅−=

−

−

1

1

κ

κη

(4)

This function can also be reshaped as the inversion of κ-1(j) to κ(i) creating η-1(i). This function returns values of 1 and -1 recording the facing of the normal vectors of all of the faces of cell i.

IV. Governing Equations

The following section describes the cell-centered scheme evaluated in this paper.

A. Finite-Volume Euler Equations

As with most unstructured codes, a finite-volume representation of the Euler equations is used. The general formulation and form of the convective flux vector here is similar to that used by Frink4. These are given below in vector integral form over a volume Ω with a surface ∂Ω:

( ) ( )[ ] 01

=−Ω

+Ω∂∂

∫∫∫∫ Ω∂Ω

dSdt

vc UFUFU (5)


7

Where the state vector is given as:

[ ]Tewvu 0ρρρρ=U (6)

The convective flux term is defined at a face as4:

( ) ( )

j

z

y

x

j

j

jjjc

n

n

n

p

pewvu

+

+

⋅=

0

0

0

ρρρρ

nVUF (7)

The viscous flux term is defined at a face as5:

( )

Θ+Θ+Θ++++++

=

zzyyxx

zzzyzyxzx

yzzyyyxyx

xzzxyyxxx

jv

nnn

nnn

nnn

nnn

τττττττττ

0

UF (8)

Where:

∂∂

+∂∂

==

∂∂

+∂∂

==

∂∂

+∂∂

==

⋅∇−

∂∂

=

⋅∇−

∂∂

=

⋅∇−

∂∂

=

y

w

z

v

eR

x

w

z

u

eR

x

v

y

u

eR

z

w

eR

y

v

eR

x

u

eR

L

zyyz

L

zxxz

L

yxxy

L

zz

L

yy

L

xx

µττ

µττ

µττ

µτ

µτ

µτ

3

2

3

2

3

2

V

V

V

(9)

The Θ terms are used to represent the work done by the combination of viscous stresses and heat conduction:

z

T

eR

kwvu

y

T

eR

kwvu

x

T

eR

kwvu

L

zzyzxzz

L

yzyyxyy

L

xzxyxxx

∂∂

+++=Θ

∂∂

+++=Θ

∂∂

+++=Θ

τττ

τττ

τττ

(10)


8

Note that the unexpected factors of 1/ReL are due to the nondimensionalization strategy, discussed later in this paper.

If an ideal gas is assumed pressure, stagnation enthalpy, and local speed of sound can be calculated as:

( ) ( )[ ]222

21

01 wvuep ++−−= ργ (11)

( )222

21

01

wvup

h +++−

=ργ

γ (12)

ρ

γp

a = (13)

Equation (5) is discretized by assuming a uniform value of U within a cell and uniform values of F over each of the surfaces of the cell:

( )[ ]( )

i

ij

jjvjci St

RFFU

Ω−=−

Ω−=

∂∂ ∑

=

11,,

κ

(14)

Where the indexing function κ(i) returns the face indices corresponding to the cell i. The fluxes across each face should be calculated in a face-by-face fashion rather than a cell-by-cell fashion in order to only compute each flux vector once. The summation of fluxes is generalized as a residual vector Ri for simplicity in later expressions. Note that this expression can be considered fully vectorized if the face flux vectors are reshaped with a vectorized index by κ(i) and then the sum over these faces is performed in a single operation. The presence of cells with differing numbers of faces will generally require a loop over the different topologies.

B. Convective Flux Discretization - Roe Flux Difference Splitting

The current convective flux discretization method implemented is Roe's approximate Riemann solver6. Again, the particulars of the equations used here are generally borrowed from Frink4. The convective flux across a cell face j is expressed as:

( ) ( ) ( ) ( )[ ]j

LRRLjc UUAUFUFF −−+= ˆ21 (15)

Where UL and UR represent the state vector on the "left" and "right" sides of the face j. Typically this corresponds to flux coming from the cells referred to by the first and second columns, respectively, of the

connectivity function κ-1(j). The matrix A is the flux Jacobian matrix computed with the following "Roe-averaged" quantities:

RLρρρ =ˆ (16)

LR

LRRL uuu

ρρ

ρρ

/1

/ˆ

+

+= (17)

LR

LRRL vvv

ρρ

ρρ

/1

/ˆ

+

+= (18)


9

LR

LRRL www

ρρ

ρρ

/1

/ˆ

+

+= (19)

LR

LRRL hhh

ρρ

ρρ

/1

/ˆ ,0,0

0+

+= (20)

( ) ( )[ ]222

21

0

2 ˆˆˆˆ1ˆ wvuha ++−−= γ (21)

Note that the computation of these various parameters is greatly accelerated by the first computing the value of

LR ρρ / . Through use of diagonalizing matrices and eigenvalues, the "artificial dissipation" term introduced in

Equation 15 can be computed as:

( ) 541ˆˆˆˆ FFFUUA ∆+∆+∆=− LR (22)

where:

( )

∆−∆+∆+∆

∆−∆

∆−∆

∆−∆

+

++

∆−∆=∆

⊥⊥

⊥

⊥

⊥

⊥

VVwwvvuu

Vnw

Vnv

Vnu

wvu

w

v

u

a

pV

z

y

x

ˆˆˆˆ

0

ˆ

ˆˆˆ

ˆ

ˆ

ˆ

1

ˆˆˆ

222

21

21 ρρF (23)

±

±

±

±

∆±∆±=∆

⊥

⊥⊥

aVh

anw

anv

anu

a

VapaV

z

y

x

ˆˆˆ

ˆˆ

ˆˆ

ˆˆ

1

ˆ2

ˆˆˆˆˆ

0

25,4

ρF (24)

Where zyx nwnvnuV ˆˆˆˆ ++=⊥ and all ∆-values are differences across the cell face, computed as ∆p = η(j)[pR - pL]

except wnvnunV zyx ∆+∆+∆=∆ ⊥. The inclusion of the normal vector facing function η(j) is necessary for

consistency. Note that the entirety of this procedure is completely vectorized if values of UR and UL are given. The above is sufficient to describe a first-order-accurate flux treatment where UR and UL are made equal to the

cell centered values at either side of the face.

C. Higher-Order State Reconstruction

A higher-order formulation for UR and UL can be obtained by assuming that the state variables vary linearly over each cell (that is, their gradient is constant within the cell). This method is essentially a generalization to mixed-element meshes of the method used by Frink and Pirzadeh7 to discretize triangular and tetrahedral elements only. A piecewise-linear reconstruction method can create what amounts to a first-order Taylor series approximation to find the values of the state vector at the face:

( )[ ]( )jiijiiij 1−=

⋅∇Ψ+=κ

rUUU

(25)


10

Where Ψi describes some flux-limiter function and rij is the vector from the centroid of cell i to the centroid of face j. Note that this is set up here so that two different values will be calculated for each face: that based on the "left-hand" neighbor cell and that based on the "right-hand" neighbor. This is consistent with the formulation of flux-splitting methods that will use these two value to compute a flux vector for the face. The only difficulty lies

with computing the value of U∇ . This is done here using the divergence theorem of vector calculus:

∫∫∫∫ Ω∂Ω=Ω∇ dSd UnU

(26)

If it is assumed that U varies linearly over Ω (i.e. U∇ is constant within the cell) and that the surface of the

volume is described by flat faces:

( )

∑=Ω

≈∇ij

jjj Sκ

nUU1

(27)

For line-segment or triangular faces, the face-centered state vectors Uj are calculated as the mean of the nodal state vectors Uk (for higher-order polygonal faces, an inverse-distance-weighting scheme based on the face centroid and the node location is necessary):

( )

∑=

=jk

k

j

j

φφUU

1

(28)

While the nodal values are in turn interpolated from the cell-centered values based on an inverse-distance-weighting scheme:

( ) ( )

= ∑∑

== ki

ik

ki

ikik rrζζ

/1/UU

(29)

Where rik is the distance between the centroid of cell i and the node k. Notice that these weights can be precomputed using knowledge of the grid alone, making this interpolation process a simple multiply-and-sum. The gradient can then be computed:

( )( )∑ ∑= =

Ω≈∇

ij

jj

jk

k

j

i Sκ φφ

nUU11

(30)

Notice that the above equation can be factored, essentially switching the order of the two summations. This will allow the inner summation to be dependent only upon grid information and mean that each Uk is only referenced once per cell:

( )

( )( )∑ ∑= =

Ω≈∇

ik kij j

jj

ki

Sj

ξ ψ φ

η

,

1 nUU

(31)

Where the connectivity function ψ(i,k) returns the indices of the faces which are connected to node k and a part of cell i. Notice that the summation inside the brackets is completely independent of solution information and can be computed before the solver starts.

A three step process develops: interpolation to nodal values with Eq. (29), computation of gradients with Eq. (31), and extrapolation to face values with Eq. (25).


11

It is also worthwhile to note that this procedure for generating node-centered data and cell-centered gradients is not specific to state vectors and can be used for any arbitrary piece of information stored at the cell centroids of an unstructured mesh.

D. Viscous Flux Discretization

Because the viscous flux terms are always elliptic, their evaluation is actually much simpler than that of the convective fluxes. The state vectors of two adjacent cells can simply be averaged to find their value at the connecting face. However, the viscous fluxes also require velocity and temperature gradients to be evaluated at the cell faces. This can be accomplished by the simple averaging of the cell-centered gradients computed as in Eq. (31). (Note that here the variable U is used to represent any generic flow variable):

( )∑

−=

∇=∇ji

ij UU1

21

κ

(32)

However, this approach can lead to aberrant behavior in some cases, so the following procedure is used, adapted from Blazek5. First, the derivative of the variable is calculated in the direction along the vector rRL from one centroid to the other by using a simple first-order finite difference. Then a modified average can be calculated:

RL

RL

jRL

RLjjj

UUUU

r

r

r

r

∂∂

−⋅∇−∇=∇l

(33)

This process is repeated separately for each of the three velocity components and temperature. Once this information is known, the viscous fluxes at the face centroids can easily be calculated.

E. Boundary Conditions

In the current implementation, all boundary conditions are enforced by setting the flux into the boundary cell across the boundary face rather than by forcibly setting the state vector values after an iteration of time integration. These fluxes are used directly in Eq. (14) without any additional artificial dissipation terms. These fluxes are set by assuming that the value of the state vector is known accurately at the face (from the boundary conditions) and thus the Roe FDS process is not involved with these fluxes.

1. Inviscid Solid Surface Boundary Condition

Inviscid surfaces are typified by the flow tangency boundary condition, namely that 0=⋅nV . Correspondingly,

the convective flux across an inviscid surface can be directly obtained from Eq. (7) as:

( )

=

0

0

zw

yw

xw

jc

np

np

np

F (34)

Where the wall pressure pw is obtained from the cell-centered value (note that this can be considered at least

first-order accurate because of another solid-wall boundary condition of 0=∇⋅ pn ).

2. Viscous No-Slip Wall Condition

The basic viscous wall condition is the "no-slip" condition:

0=== wvu (37)


12

This actually leaves the boundary condition for the convective flux term unchanged, as it already contained no velocity term and the pressure boundary condition is unchanged. The viscous stress terms τ remain unchanged, but the velocity gradient components must be calculated assuming that the no-slip condition has been applied. This is most easily performed by simply setting the nodal values of velocity to zero on all no-slip surfaces before the gradient calculation is performed for the entire grid. Because there is no flow through the wall, the energy flux

vector Θ simplifies to Θ = k∇T. If an adiabatic wall is desired, this temperature gradient can simply be set to zero (note that the actual usage of the vector Θ is its normal component, thus any gradient in the wall-tangent direction is irrelevant). If an isothermal wall temperature is desired, the temperature at wall nodes should be set before gradient calculation, just as with the velocity values.

3. Characteristic Far-Field Boundary Condition

This boundary condition for the far-field exterior of a computational grid is based on the concept of characteristic variables that remain constant along particular characteristic lines. This particular formulation is based upon that presented by Blazek5. "Information" (i.e. flow variables) is propagated into and out of the computational domain based on the normal Mach number and direction of flow at the face. This condition variable is computed using an outward-facing normal vector as:

( ) ( )[ ]jzyxj awnvnunjM /, ++=⊥ η (38)

This sets up four different cases where the state vector (or at least variables from which a state vector can be computed) at the face Uj is computed from the freestream state vector U∞ and the boundary cell state vector Ui (for supersonic inflow, supersonic outflow, subsonic inflow, and subsonic outflow, respectively):

∞= UU j when 1, −≤⊥jM (39)

ij UU = when 1, ≥⊥jM (40)

( ) ( ) ( ) ( )[ ]( )( )

( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )

−−−−−−

−+

−+−+−−+

=

∞∞

∞∞

∞∞

∞∞

∞∞∞∞

iijz

iijy

iijx

j

iziyixiii

j appnjw

appnjv

appnju

app

wwnvvnuunajpp

wvu

p

i

ρηρηρη

ρρη

ρ

/

/

/

/ 2

21

, 01 , <<− ⊥jM (41)

( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )

−+

−+

−+

−+

=

∞

∞

iijizi

iijiyi

iijixi

iii

jappnjw

appnjv

appnju

app

p

w

v

u

p

ρηρηρη

ρρ

/

/

/

/ 2

when 10 , <≤ ⊥jM (42)

All i-subscripted quantities here refer to the cell-centered value of that quantity in the boundary cell. Once the state vector at the face Uj is computed, a flux vectors Fc,j and Fv,j can be computed using Equations 7 and 8, respectively.

F. Explicit Time Integration Schemes

It was intended that both implicit and explicit time integration schemes would be evaluated during the course of this project. Though implicit schemes are much more common in modern CFD codes due to their superior convergence properties, they are nearly not as vectorizable as explicit schemes are. The explicit schemes presented here are fully functional in the code.


13

1. Explicit Runge-Kutta Method

Explicit time integration is accomplished using a Runge-Kutta scheme on the residual vectors Ri from Eq. (14). This formulation is again borrowed from Frink4. Superscripts denote the time level n, the maximum number of Runge-Kutta subiterations or "stages" m, and the current Runge-Kutta subiteration in parentheses:

=Ω∆

−=

Ω∆

−=

=

+

−

)(1

)1()0()(

)0(

1

)0()1(

)0(

m

i

n

i

m

i

i

mi

m

i

i

i

ii

n

ii

t

t

UU

RUU

RUU

UU

α

α

M (43)

Where

1

1

+−=

kmkα (44)

2. Implicit Residual Smoothing

An implicit residual smoothing method is employed as by Frink4 in order to allow for an increase in the maximum allowable time step for explicit integration. The concept of the method is that residuals are filtered through a Laplacian operator to smooth out discontinuities and enhance stability:

iii RRR2∇+= ε (45)

The Laplacian operator is approximated by a sum of the differences between the residuals of adjacent cells I and the cell itself i:

( )( )( )

∑=

−=∇iI

iIi

κφ

RRR2 (46)

These equations are combined and solved through a Jacobi iteration, where m represents the steps of this iteration:

( ) ( )

( )( ) ( )( )

+

+= ∑∑

==

−

iIiI

m

Ii

m

i

κφκφ

εε 111RRR (47)

The constant ε is set to control the diagonal dominance of the system. Blazek5 suggests that values between 0.5 and 0.8 are most useful. Notice that when run in a vectorized manner, this method will require two nested loops: an outer loop over the different types of cell topologies (e.g. triangles and quadrilaterals) and an inner loop over the number of neighbors in that particular cell type. This makes it a well-vectorized method that should not interfere substantially with the fantastic vectorization of the explicit method.

Most sources suggest that two Jacobi iterations are sufficient, but in practice here it has been found that three or four can be useful in some cases. Compared with the computational expense of calculating gradients, limiter functions, or even the Roe differencing, additional iterations of the smoothing operation are extremely fast, due to their good vectorization and low operation count.


14

G. Implicit Time Integration Schemes

Implicit methods were created for this solver, but they appear to have some issues that prevent them from being fully functional. This is discussed later, but the implementation of the methods is given here regardless. Implicit methods are formulated using a backward-Euler time discretization to Eq. (14):

( ) n

i

n

i

n

ii

tRRU ββ −−−=∆

∆Ω + 11

(48)

Where ∆Un is the delta to the state vector of the next iteration:

nnn

UUU ∆+=+1 (49)

The residual vector at time step n+1 is estimated using a Taylor series expansion from time step n:

n

i

i

n

i

n

i UU

RRR ∆

∂∂

+≈+1 (50)

Combining Equations (48) and (50) yields the following implicit solution formulation:

n

i

n

i

n

i

i

i

tUJRU

U

R∆=−=∆

∂∂

+∆Ω

β (51)

Where the quantity in brackets on the left is called the implicit operator J. The constant β is set to 1 for typical implicit methods. In general, the "explicit operator" Rn is still computed using a flux-splitting or flux-differencing scheme (such as Roe's) but a simpler scheme is sometimes used to split fluxes within the implicit operator. In general, the derivative in the implicit operator is divided into components for each face of the cell:

( )∑=

∂

∂−

∂

∂=

∂∂

FN

ij

j

jvjc

i

S

κU

F

U

F

U

R ,, (52)

Roe's Flux Difference Splitting can be applied here as well to expand the flux Jacobians. By assuming locally constant Roe matrices, a reasonably accurate approximation to the product of the flux Jacobian and the state vector step is obtained:

( ) ( ) ( ) ∑=

∆−∆−∆+∆∆

≈∆∂∂ FN

m

n

mL

n

mRmRoe

n

mRmRc

n

mLmLcmnc S

1

,,,,,,2

UUAUAUAUU

R (53)

Where AL is computed from the left-hand cell, AR is computed from the right-hand cell, and ARoe is computed from the Roe-averaged values between the two cells, all at iteration n-1. If it is assumed that the "current" cell is always the left-handed one, the following implicit scheme is obtained:

( ) ( ) ( )

( ) ( ) ( )

n

i

N

ij

jjRoeRjc

j

i

N

ij

jRoeLic

ji

FF SS

tRUAAUAA −=

∆−

∆+∆

+

∆+

∆Ω ∑∑

== κκ

,,22

(54)

Notice that the coefficient of the first term (in brackets) will form the diagonal of the implicit matrix equation. The individual components of the second term (without the state vector) are the off-diagonal terms. A sparse linear system is created:


15

n

iRUM −=∆ (55)

Note that the term involving ∆t is added to the diagonal of the implicit matrix. This serves to stabilize the system and decrease the condition number of the matrix. A small time step will mean significantly increasing the diagonal-dominance of the matrix. This time step is still calculated via a CFL number as it is in the explicit case, but the limit of the maximum useful CFL number is more in how singular of a matrix (how large of a condition number) is acceptable than about the inherent stability of the numerical method, as is the case with the explicit solution.

1. Flux Jacobian Matrices

Direct computation of the flux Jacobian is sometimes (but not always) necessary in evaluating the derivative of the residual with respect to the state vector. For these instances, the formulation of the convective flux Jacobian is given here:

( )

−−−−

−−−−

−−−−

−−−−

==∂∂

VVwaanVvaanVuaanaV

nawnaVvnawnunawnVun

nawnavnvnaVunavnVvn

nawnaunvnaununaVVun

nnn

zyx

zzzyzxz

yyzyyxy

xxzxyxx

zyx

cc

~~~~~

~~

~~

~~00

2121211

2322

2232

2223

γφφφφ

AU

F (56)

Where:

( )( )21

1~

320

1

222

21

−=−=−=

++−=++=

γγφρ

γ

γφ

aae

a

wvuwnvnunV zyx

(57)

Note that if all of the terms seen above are expanded, the flux Jacobian is a function only of the state vector and the normal vector and it is an odd function with respect to the normal vector. That is, Ac(U,-n) = -Ac(U,n). This means that, just like the flux vectors, the flux Jacobian need only be computed once for each face, provided that it is based on symmetric state vector information. So, ARoe only has to be computed once for each face. AL and AR must be calculated separately, but for the adjacent cell, they can be swapped and their signs changed.

2. Implicit Boundary Conditions

Boundary conditions are applied in the implicit equation solely in the diagonal terms of the implicit matrix. This is possible because the boundary conditions here only depend on the cell at the boundary (and potentially some constant information such as freestream conditions). This is done with an additional flux Jacobian matrix that is inserted into the equation as another AL that has no corresponding AR or ARoe.

For an inviscid wall, a special flux Jacobian matrix can be used that is derived from the inviscid wall flux (Eq. 34):

( )

−−−−−−−−−

=

00000

00000

2222

2222

2222

zzzzz

yyyyy

xxxxx

wallc

nawnavnaunan

nawnavnaunan

nawnavnaunan

φφφ

A (58)

The constants here have the same values as described in Eq. 57. For far-field boundaries, the standard form of the flux Jacobian from Eq. 56 is used, but plugging in the

boundary values instead. For supersonic inflow, the flux Jacobian matrix can be substituted with the null matrix as


16

there is no dependence on the internal cell state vector. For supersonic outflow, the Jacobian matrix from inside the cell can be used (AL as usual). For subsonic inflow and outflow, Equations 41 and 42 are used to calculate the values of the primitive state vector on the face and then Eq. 56 is used to compute the flux Jacobian.

3. Gauss-Seidel Scheme

The Gauss-Seidel scheme is an iterative method to solve matrix equations. The scheme factors the implicit operator into three parts: a strictly upper-triangular portion U, a strictly lower-triangular portion L, and a diagonal portion D:

( ) n

i

n

i

n

i RUULDUM −=∆++=∆ (59)

This equation can be solved with an iterative step:

( ) ( ) ( )1−∆−−=∆+ m

i

n

i

m

i UURULD (60)

This method is designed to run for several subiterations during each iteration of the flow solver. In the method used presently, the first value of ∆U(0) is assumed to be all zeros. In subsequent subiterations, the value found in the previous iteration is used. After a specified number of subiterations, the value of ∆Um is used as ∆Un to move onto the next solver iteration.

These equations are solved directly with sparse matrix methods. In many CFD applications, only one step is taken in iterative methods before reevaluating the implicit matrix. It has been found, however, that taking a small number of steps (two or three) can increase the stability of the method. This is generally not too crippling to the runtime of a complete iteration, as the most time-consuming part of the operation is the assembly of the implicit matrix. Solving these sparse triangular systems is actually a very fast operation.

4. Successive Over-Relaxation (SOR) Scheme

Successive Over-Relaxation (SOR) is an alternative iterative method to solve linear systems. Like the Gauss-Seidel method, the matrix is factored into its U, L, and D components. In this method, however, a relaxation factor ω in the formulation of the method:

( ) ( ) ( )( ) ( )11 −∆−+−−=∆+ m

i

n

i

m

i UDURULD ωωωω (61)

Other than the slight alterations to the operative equation and the addition of the parameter ω, the method operates the same as the Gauss-Seidel.

H. Time Step Calculation

Local time-stepping is employed in the implicit and explicit schemes to increase convergence speed. Time steps for each cell are calculated from equations given by Blazek5:

z

i

y

i

x

i

itΛ+Λ+Λ

Ω=∆ σ (62)

Where σ is the CFL number and the Λ variables represent the spectral radii of the flux Jacobian in the x, y, and z directions:

( ) x

ii

x

i Sau ˆ∆+=Λ , ( ) y

ii

y

i Sav ˆ∆+=Λ , ( ) z

ii

z

i Saw ˆ∆+=Λ (63)

And the S∆ variables represent the projected area of the cell in the y-z, x-z, and x-y planes:

( )

∑=

=∆ij

jjx

x SnSκ

,21ˆ ,

( )∑=

=∆ij

jjy

y SnSκ

,21ˆ ,

( )∑=

=∆ij

jjz

z SnSκ

,21ˆ (64)


17

The maximum allowable CFL number for convergence depends heavily on which of the various methods presented here are in use and upon several other options, such as the number of Runge-Kutta subiterations or implicit residual smoothing iterations.

I. Nondimensionalization of Variables

All of the flow variables are nondimensionalized in an effort to aid computational stability. This nondimensionalization strategy is based upon that described by Tannehill et al8. All variables seen elsewhere in this document are replaced by nondimensional variables, denoted here (but nowhere else) with an asterisk:

∞∞∞∞

∞∞∞∞

∞

===

====

====

T

TT

V

pp

V

ww

V

vv

V

uu

VL

tt

L

zz

L

yy

L

xx

*

2

**

****

****

ρρρ

ρ

µµ

µ (65)

In the event that there is a zero freestream velocity, the freestream speed of sound may be used in its place. Note that this nondimensionalization causes an interesting redefinition of a few common constants:

∞∞∞

==== TCCTTM

R o

***

02

* 11

µγ

(66)

Where R is the gas constant and C, T0 and µ0 are the coefficients in Sutherland's formula. With these constants redefined, the ideal gas law and Sutherland's formula will continue to behave as expected to compute temperature and dynamic viscosity:

**** TRp ρ= (67)

23

*

0

*

**

**

0*

0

*

++

=T

T

CT

CTµµ (68)

The only other consequence of the nondimensionalization within the code is the factor of 1/Re that shows up in the viscous flux terms.

V. Success of Solver and Quality of Results

A. Explicit Euler Results

When run in explicit Euler mode, the solver created here produces reasonably accurate results for subsonic and transonic cases. Figures 1 and 2 present pressure coefficient distributions for a NACA 0012 airfoil with several different flux limiting options alongside wind tunnel data9. All of these cases were run with a CFL number of 3.5 for 5000 iterations, with each iteration containing three Runge-Kutta stages. Three iterations were used in the implicit residual smoothing method with a coefficient of 0.5. Figure 1 presents a case at Mach 0.3 and 4.04º angle-of-attack. Figure 2 presents a case at Mach 0.703 and 4.03º angle-of-attack. The different flux limiter options shown include: fully 1st-order (Ψi = 0), fully 2nd-order (Ψi = 1), Venkat's limiter, and Barth and Jespersen's limiter.


18

The lower Mach number case seems fairly reasonable everywhere. The primary difference between the 1st- and 2nd-order cases are in the value of Cp at the suction peak. The 1st-order method over-predicts slightly (a "peakier" distribution) while the 2nd-order method under-predicts slightly (a "rounded" distribution). Venkat's limiter actually does a very nice job of splitting the difference and coming up with the right suction peak pressure (it over-predicts the value by approximately 3%). The Barth and Jespersen limiter does not substantially improve upon the 2nd-order case.

The transonic case is a slightly different story. The 1st- and 2nd-order cases differ drastically in their placement of the shock and the sharpness of the shock. The 1st-order case produces a very crisp, clean shock, but it is misplaced by 5-7% chord. The 2nd-order case

"smears out" the shock to a considerable extent, but it does seem to begin in approximately the right location. These behaviors are fairly typical of transonic Euler CFD solutions. Unfortunately, the flux limiters do not seem to appreciably help the problem. They are apparently both very "aggressive" (i.e. they tend more towards 2nd-order than 1st-order). Venkat's limiter does result in a strengthening of the shock in comparison to the 2nd-order case, but it still under-predicts the strength. Looking at the 1st- and 2nd-order cases, it would appear that there should be some compromise between them that would be a very good solution, but neither limiter finds it. It is worthwhile to note that the difficulty in placing the shock properly is likely largely grid-related, as no adaptation is performed to capture the shock location.

Figures 3 and 4 show colored Mach number contours around two of these cases. Both are fully 1st-order solutions of the NACA0012 at the two flight conditions described above.

Figure 1 - Cp Distribution for NACA0012 at Mach 0.3.

Figure 2 - Cp Distribution for NACA0012 at Mach 0.7.

Figure 4 - Mach Contours - M∞=0.3.

Figure 3 - Mach Contours - M=0.7.


19

B. More Complex Modes of Operation

The viscous and implicit modes of the solver unfortunately still contain enough errors or bugs that they do not produce good results. The viscous mode seems to run well and produce reasonably correct-looking results, but at some point along its operation, it has a tendency towards abrupt and massive divergence. Strange oscillations appear in the temperature boundary layer just before these divergences, but it is currently unknown whether these are the cause or the result of the divergence.

The implicit mode of the solver has its own odd set of issues. The implicit matrix created is much too close to singular for accurate solution of the system, especially at high CFL numbers. At a CFL number of 1, Matlab estimates the condition number of the matrix in the vicinity of 106-107. The is at the bare upper limit of how singular a matrix Matlab can solve directly and the results of the solution should be highly questionable. The addition of a simple preconditioner (such as a Jacobi preconditioner) or the use of an iterative method (such as a Gauss-Seidel or Successive Over-Relaxation method) shrinks the condition number of the implicit matrix somewhat and allows use of CFL numbers of around 2, but not anywhere the enormous CFL numbers typical of successful implicit methods.

It is unknown what is causing these condition numbers to be so large, but it does not seem to be affecting the solution quality overly much. The solutions produced by the implicit and explicit modes of the Euler solver seem to be very comparable. However, since the explicit solver is actually more stable and requires slightly less computation time per iteration, there is currently no real point in utilizing the implicit solver.

While the viscous and implicit modes are not as functional as they ought to be, they still do a very good job of replicating the computations necessary to run these types of methods. Consequently, these modes of the solver can still be used for evaluation of runtimes and comparison between CPU and GPU computations. They will necessitate using small CFL numbers and relatively few iterations, but while 500 iterations at a CFL number of 0.5 may not produce a converged solution, the runtime of this solution should be perfectly well representative of the time-per-iteration required for a more converged solution.

VI. Timing Results

A. Hardware Used

The following hardware was used in all tests shown here. The release dates and prices of hardware are given in order to inform comparisons.

• Intel Core i7 920 Processor: Released Nov. 2008 at $285

• 6 GB of DDR3 RAM: Purchased Mar. 2009 at $100

• NVIDIA GeForce GTX480: Released Mar. 2010 at $500 (1536 MB of on-board memory) While the graphics card used was considerably more expensive than the CPU, at least some of the cost of RAM

should be added to the cost of the CPU. Building a large computing cluster with many CPUs will require the purchase of a large amount of RAM, whereas the graphic card includes its own RAM. The GPU is also a newer architecture than the CPU. This temporal jump was deliberately allowed because the of large increase in scientific computing efficiency from the 2009-vintage GeForce 200 series to the 2010-vintage GeForce 400 series. (Prior to the 400 series, the cards were not specifically design to perform double-precision calculations and thus their performance when doing so was substandard). The upgrade in CPU performance over the same year was more modest.

Overall, it is probably fair based simply upon the release date and price differences to expect the GPU to be perhaps twice as fast as the CPU. This means that for a difference in runtime to be significant in terms of performance per price, the speedup should be somewhat greater than 2x.

B. Explicit CPU vs. GPU Timing Results

Figure 5 shows the runtimes (in seconds per iteration) of solutions of a NACA0012 airfoil using both the CPU and GPU methods. In all cases, the solver was running fully 2nd-order fluxes with no limiter, only a single Runge-Kutta subiteration, and

Figure 5 - CPU and GPU Runtimes vs. Cell Count.


20

two implicit residual smoothing steps. These timing results are plotted versus the number of cells in the grid, ranging from 20,000 to 600,000 cells. Figure 6 shows the same data, but interpreted as a "speedup" multiplier (the CPU time per iteration divided by the GPU time per iteration). The GPU execution ranges from slightly slower than CPU execution at the minimum grid size to approximately six times faster at the maximum grid size.

Clearly the trends indicate that the GPU computation becomes more and more advantageous as the grid size increases. There are a few explanations for this phenomenon. At very low grid sizes, some of the operations performed by the solver will not even involve all of the 480 processor cores on the card. This effect is probably minimal in the timing runs shown here, all of which have many more cells than that. More predominant is likely the ratio of time spent interpreting instructions to time spent executing instructions. The GPU is built to run repetitive, predictable operations on massive data sets. So, the more massive the data set, the better it outperforms the CPU. On smaller data sets, the commands are changing too quickly for this speed to be extremely useful. This causes the speedup numbers to follow a "diminishing returns" curve right up until the point where the video card runs out of memory. The case with 600,000 cells is presented here, but a case with 700,000 cells would not run because of memory limitations.

It is worthwhile to mention that in virtually all scientific computing applications, there are tradeoffs to be made between memory usage efficiency and computational efficiency. In this project, the decision was made to always favor computational efficiency. This means that virtually any piece of data that could be computed only once and stored was treated that way, from grid connectivity matrices to face normal vectors and cell areas. This practice is beneficial to the runtime of the code, but imposes additional limitations on the maximum grid size that can be used on a relatively memory-limited device such as the GTX 480. (Note that purpose-built scientific computing cards such as NVIDIA's "Tesla" series can be purchased with much more available memory).

C. Runtimes of Implicit Modes

Unfortunately, the implicit modes of operation could not be tested on the GPU system because sparse matrix support in Jacket is still in its infancy and does not include all of the functionality used by the implicit solver. However, some results are presented here in an attempt to show the potential performance of the GPU on implicit operations.

Figure 7 compares the runtime per iteration of three different time integration schemes: the three-stage Runge-Kutta, a three-step Gauss-Seidel, and a three-step Successive Over-Relaxation. Both of these implicit methods use Matlab's sparse matrix functionality to build the implicit matrix and do the requisite matrix operations to solve the system. The Gauss-Seidel is consistently about 1.8 times slower than the explicit method and the SOR is about 2.7 times slower. These numbers are relatively constant with increasing number of cells in the mesh, up to the largest tested value of just under 400,000 cells. If these implicit methods were converging properly and allowed the much larger CFL number typical of implicit methods, the solution time of both of these methods would be much lower than that of the explicit method.

D. Lessons Learned About GPU Operations

Aside from the numerical details available in runtimes, a number of trends have been observed over time. Some of these are presented here.

Figure 6 - CPU-to-GPU Times Speedup vs. Cell Count.

Figure 7 - Comparison of Runtimes of Different

Time Integration Schemes.


21

The GPU can be extraordinarily slow at many searching- and sorting-type operations. This tendency has particularly presented itself in grid preprocessing routines, as the code that generates grid connectivity matrices involves a lot of these sorting and searching operations. This code was converted to run on the GPU only to find that its runtime went up by as much as two orders of magnitude in some cases. After this was realized, these preconditioning operations were conducted solely on the CPU. For this scale of operations (two-dimensional grids run on a single CPU or GPU) this is not too crippling. On the other hand, in a massively-parallel setting running three-dimensional grids with millions of cells, this fact could be a problem. A set of GPUs running in parallel in a cluster may have trouble generating these connectivity matrices without the aid of a set of CPUs that would otherwise be nearly idle during the solution process. On the other hand, these connectivity matrices only have to be created once for a given grid and multiple cases (different Mach numbers, Reynolds numbers, angles-of-attack) could be run in parallel on multiple sets of GPUs from connectivity matrices generated on a single set of CPUs.

Indexing operations are not nearly as crippling as might initially be suspected. When this project was started, it was assumed that all of the mathematical operations run on the GPU would have massive speedups while the indexing operations (for example, the operation of collecting all of the face flux values that must be summed to give the net residual for a cell) would likely be slower than on the CPU. Instead, it was found that the indexing operations actually saw nearly the same speedups as the mathematical ones. This is often problematic to present as numerical data, but it has been the author's experience from sifting through code profiler results comparing CPU and GPU computations. Perhaps this particular observation is overly influenced by the choice of Matlab as a computing environment, but it was a welcome result nonetheless.

A few even stranger tendencies have been noticed in GPU operation. One of these is the fact that division operations on the GPU are considerably slower than multiplication operations. This is generally true about CPU computing as well (and is frequently mentioned in computer science references concerned with algorithm efficiency), but has largely been overcome by recent processor and compiler design. In particular, attempting to divide by a constant will virtually always be changed during compilation into a multiplication operation. The author's anecdotal experience from within Matlab is that division operations on the CPU might require on the order of 1.25 times as much time as multiplication operations. On the GPU, this factor might go up to as much as four or five times as long. The GPU operation may still be faster than the CPU operation, but the programmer is well-served by attempting to reduce the number of division operations in their code.

Probably the most unexpected thing discovered while running timing cases was the impact of the computer's other activities on the processing ability of the GPU. In particular, whenever the computer's screen saver would start up, the time required per iteration on the GPU would increase dramatically, while the CPU runtimes would be virtually unaffected. Then, when the computer's power saving settings would turn off the monitors altogether, the runtimes would drop considerably again, to even lower than they were originally. None of this is particularly shocking, but it did create some extremely odd timing plots before it was understood. The timing results here were computed with the computer set to deactivate its displays completely whenever the user is inactive for even a short time and the computer's mouse was unplugged for the duration of the runs to prevent an accidental "waking."

VII. Conclusions and Future Work

GPU-based computing has been shown to have great promise, even within the realm of unstructured-grid codes, where complex indexing operations become so much more prevalent than in structured codes. This is an extremely promising result because it would severely diminish the usefulness of GPUs in CFD if one had to revert to the much more tedious grid generation process for structured codes in order to obtain the runtime efficiency of the GPU.

The implicit mode of operation needs considerable additional work. It is currently barely functional to the point of not being useful. However, the author has seen tremendous promise in the implicit mode. Sparse matrix operations in Matlab are extremely efficient and there are many built-in methods for iteratively solving sparse matrix systems. If CFL numbers of reasonable size could be obtained with the implicit mode, this host of solution methods could be brought to bear to generate extremely rapid solutions.

One of the largest steps that can be taken to extend this work is the extension of the code to 3-D solutions. All of the methodology set up to be quickly and easily extensible to 3-D operations, but this has not been attempted as of yet. The greatest speedups from CPU to GPU were seen on large grids with hundreds of thousands of cells. These grids are an overwhelming amount of overkill for a simple 2-D geometry such as the NACA0012, but these cell counts would be right in line with what is necessary to get Euler results on three-dimensional geometries of mild complexity.


22

Figure 8 - Simple Example Grid

Appendix A. Example Grid Connectivity Functions

Figure 8 shows a simple mixed-element grid consisting of four triangular cells and two quadrilateral cells. The large black numbers indicate cell indices (i), blue numbers with arrows indicate face indices (j), and red numbers indicate node indices (k). The following matrices are the correct connectivity functions for this grid, as described in Section III.B above. (Note that the NaNs indicated placeholder values, not any kind of errors).

( )

=

9887968685657454635352324121

jφ

( )

=

1071384721

11141211109956543

iκ

( )

=−

NaNNaNNaN

NaN

NaN

NaNNaNNaN

j

464

436323

665

22151

155

1κ

( )

=

4785

4521896

865

563

532

iξ (69)

References 1Parallel Computing Toolbox - MATLAB. MathWorks Web Site [Online], URL: http://www.mathworks.com/products/

parallel-computing/ [cited 18 June 2011]. 2AccelerEyes - MATLAB GPU Computing. [Online] URL: http://www.accelereyes.com/ [cited 18 June 2011]. 3GPUmat: GPU toolbox for MATLAB. [Online] URL: http://gp-you.org/ [cited 18 June 2011]. 4Frink, N. T. "Upwind Scheme for Solving the Euler Equations on Unstructured Tetrahedral Meshes," AIAA Journal, Vol.

30, No. 1, 1991. 5Blazek, J. Computational Fluid Dynamics: Principles and Applications, 2nd ed., Elsevier, New York, 2005. 6Roe, P. L. "Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes," Journal of Computational

Physics, Vol. 43, 357-372, 1981. 7Frink, N. T. and Pirzadeh, S. Z. "Tetrahedral Finite-Volume Solutions to the Navier-Stokes Equations on Complex

Configurations," NASA TM-1998-208961, 1998. 8Tannehill, J. C., Anderson, D. A., and Pletcher, R. H. Computational Fluid Mechanics and Heat Transfer, 2nd ed., Taylor &

Francis: Philadelphia, PA, 1997. 9NATO Advisory Group for Aerospace Research and Development, "Experimental Data Base for Computer Program

Assessment," AGARD AR-138, 1979.

vectorized unstructured cfd methods for gpu computing

Documents