multi-gpu, implicit time stepping for high-order methods on unstructured...

33
Multi-GPU, Implicit Time Stepping for High-order Methods on Unstructured Grids Jerry Watkins * , Joshua Romero * , and Antony Jameson Department of Aeronautics and Astronautics, Stanford University, Stanford, CA, 94305 In this paper, the development, implementation and performance of a multi-GPU, im- plicit, high-order compressible flow solver for unstructured grids is discussed. The solver utilizes the direct Flux Reconstruction (DFR) method and a multicolored Gauss-Seidel (MCGS) method to converge the steady state Euler equations in a multi-GPU environ- ment. The MCGS scheme is able to obtain a fast, grid converged lift coefficient of 0.1795 for the NACA 0012 airfoil at a 1.25 degree angle of attack, Mach 0.5. The results are obtained with fewer degrees of freedom when compared to Overflow and CFL3D. The high arithmetic intensity and the ease of parallelization makes MCGS an ideal choice for mul- tiple GPUs. The memory size of the left-hand side matrices in the implicit method limits the scheme’s use for high polynomial orders on a single GPU but it is shown that the bottleneck in memory usage can be mitigated by using multiple GPUs. The scheme is able to maintain near perfect weak scaling showing that it can be effectively distributed over multiple GPUs to solve large problems without a significant degradation in performance. I. Introduction Significant contributions have been made towards progressing high-order methods in computational fluid dynamics (CFD). High-order methods refer to a branch of numerical algorithms which employ higher than second order spatial discretization. These algorithms can improve the accuracy of a simulation at a reduced computational cost. 1 In the past, these methods have failed to penetrate the computational design pro- cess because they are generally less robust and more complex to implement than commonly used low-order methods. In recent years, these disadvantages have been mitigated and high-order methods are becom- ing increasingly more popular in the study of steady and unsteady, vortex dominated flows over complex geometries. These flows are often more difficult to simulate using low-order methods because of the high computational cost and increased sensitivity to geometry and numerical dissipation. Discontinuous finite element methods have been the focal point of recent efforts in developing a high- order compressible flow solver for unstructured grids. Popular examples include the Discontinuous Galerkin (DG) scheme 2–4 and the Spectral Difference (SD) scheme. 5, 6 Huynh 7 proposed a Flux Reconstruction (FR) approach for tensor-product elements that provides a generalized differential framework for recovering both the collocation based nodal DG scheme as well as a version of the SD scheme. This framework has been successfully extended to triangular 8, 9 and tetrahedral 10 elements as well. Even more general frameworks such as the Correction Procedure via Reconstruction (CPR) 11 have now been proposed that unify the FR and the Lifting Collocation Penalty (LCP) 12 formulations. Recently, the direct Flux Reconstruction (DFR) method has been developed as a simplified formulation of the FR method that reduces the theoretical and implementation complexity of the FR method. 13 The push towards high-order, unsteady flow simulations over complex geometries has sparked a need for faster convergence for large scale problems. Accelerated explicit methods and the polynomial multigrid method have been used to accelerate convergence rates but are sometimes not enough to overcome the stiffness found in aerodynamic applications where the cell volume varies by several orders of magnitude between the body and the far field. 14–16 For these class of problems, implicit methods offer an alternate means to converge steady state solutions or drive the solution to physical time steps in dual time stepping * Ph.D. Candidate, Department of Aeronautics and Astronautics, Stanford University, AIAA Student Member Professor, Department of Aeronautics and Astronautics, Stanford University, AIAA Member 1 of 33 American Institute of Aeronautics and Astronautics 46th AIAA Fluid Dynamics Conference 13-17 June 2016, Washington, D.C. AIAA 2016-3965 Copyright © 2016 by Jerry Watkins, Joshua Romero, Antony Jameson. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission. AIAA Aviation

Upload: others

Post on 30-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Multi-GPU, Implicit Time Stepping for High-order

    Methods on Unstructured Grids

    Jerry Watkins∗, Joshua Romero∗, and Antony Jameson†

    Department of Aeronautics and Astronautics, Stanford University, Stanford, CA, 94305

    In this paper, the development, implementation and performance of a multi-GPU, im-plicit, high-order compressible flow solver for unstructured grids is discussed. The solverutilizes the direct Flux Reconstruction (DFR) method and a multicolored Gauss-Seidel(MCGS) method to converge the steady state Euler equations in a multi-GPU environ-ment. The MCGS scheme is able to obtain a fast, grid converged lift coefficient of 0.1795for the NACA 0012 airfoil at a 1.25 degree angle of attack, Mach 0.5. The results areobtained with fewer degrees of freedom when compared to Overflow and CFL3D. The higharithmetic intensity and the ease of parallelization makes MCGS an ideal choice for mul-tiple GPUs. The memory size of the left-hand side matrices in the implicit method limitsthe scheme’s use for high polynomial orders on a single GPU but it is shown that thebottleneck in memory usage can be mitigated by using multiple GPUs. The scheme is ableto maintain near perfect weak scaling showing that it can be effectively distributed overmultiple GPUs to solve large problems without a significant degradation in performance.

    I. Introduction

    Significant contributions have been made towards progressing high-order methods in computational fluiddynamics (CFD). High-order methods refer to a branch of numerical algorithms which employ higher thansecond order spatial discretization. These algorithms can improve the accuracy of a simulation at a reducedcomputational cost.1 In the past, these methods have failed to penetrate the computational design pro-cess because they are generally less robust and more complex to implement than commonly used low-ordermethods. In recent years, these disadvantages have been mitigated and high-order methods are becom-ing increasingly more popular in the study of steady and unsteady, vortex dominated flows over complexgeometries. These flows are often more difficult to simulate using low-order methods because of the highcomputational cost and increased sensitivity to geometry and numerical dissipation.

    Discontinuous finite element methods have been the focal point of recent efforts in developing a high-order compressible flow solver for unstructured grids. Popular examples include the Discontinuous Galerkin(DG) scheme2–4 and the Spectral Difference (SD) scheme.5,6 Huynh7 proposed a Flux Reconstruction (FR)approach for tensor-product elements that provides a generalized differential framework for recovering boththe collocation based nodal DG scheme as well as a version of the SD scheme. This framework has beensuccessfully extended to triangular8,9 and tetrahedral10 elements as well. Even more general frameworkssuch as the Correction Procedure via Reconstruction (CPR)11 have now been proposed that unify the FRand the Lifting Collocation Penalty (LCP)12 formulations. Recently, the direct Flux Reconstruction (DFR)method has been developed as a simplified formulation of the FR method that reduces the theoretical andimplementation complexity of the FR method.13

    The push towards high-order, unsteady flow simulations over complex geometries has sparked a needfor faster convergence for large scale problems. Accelerated explicit methods and the polynomial multigridmethod have been used to accelerate convergence rates but are sometimes not enough to overcome thestiffness found in aerodynamic applications where the cell volume varies by several orders of magnitudebetween the body and the far field.14–16 For these class of problems, implicit methods offer an alternatemeans to converge steady state solutions or drive the solution to physical time steps in dual time stepping

    ∗Ph.D. Candidate, Department of Aeronautics and Astronautics, Stanford University, AIAA Student Member†Professor, Department of Aeronautics and Astronautics, Stanford University, AIAA Member

    1 of 33

    American Institute of Aeronautics and Astronautics

    46th AIAA Fluid Dynamics Conference 13-17 June 2016, Washington, D.C.

    AIAA 2016-3965

    Copyright © 2016 by Jerry Watkins, Joshua Romero, Antony Jameson. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.

    AIAA Aviation

    http://crossmark.crossref.org/dialog/?doi=10.2514%2F6.2016-3965&domain=pdf&date_stamp=2016-06-10

  • methods by means of larger pseudo time steps. In particular, lower-upper symmetric Gauss-Seidel (LU-SGS)has shown promising results in unstructured compressible flow solvers utilizing finite volume methods, SDmethods, the CPR method and more recently, the compact high-order method.16–20

    Graphical Processing Units (GPUs) are also becoming more popular among those in the scientific com-puting community and can demonstrate a substantial performance gain for programs using high-order meth-ods.14,15,19,21,22 The DFR method is well suited for GPUs because the vast majority of operations areelement local and the increase in amount of work per degree of freedom couples well with the high computa-tional potential of GPUs. Castonguay et al. has shown the potential of these accelerators to produce resultsfor unsteady simulations using explicit time stepping.14 Typically, implicit time stepping has been a moredifficult problem to address because of the increase in memory requirements and the serial aspects of thealgorithm but there have been advances which show that there are methods of overcoming these problemson a single GPU.19

    In this paper, a multi-GPU high-order compressible flow solver for unstructured grids is developed andused to perform implicit time stepping on steady Euler problems. The paper is formatted as follows. SectionII gives a detailed overview of the DFR method on one-dimensional, two-dimensional quadrilateral andtriangular elements and the Euler equations. Section III gives an overview of the explicit RK44 method, theimplicit Multicolored Gauss-Seidel (MCGS) method and a description of the analytical implicit Jacobianused in the implicit method. Section IV discusses the implementation details of the explicit and implicitmethod on a multi-GPU system. Section V provides two numerical tests which verify that the MCGSmethod produces accurate results. Lastly, Section VI provides an overview of the speedup of the GPUimplementation over the CPU implementation and a strong and weak scalability analysis of the multi-GPUimplementation.

    II. Direct Flux Reconstruction Method

    In this section, a detailed overview of the DFR method is provided. In previous work, this schemehas been proven capable of recovering the FR form of the nodal DG method with a simplified procedurerelative to the existing FR framework.13 The section begins with a description of the method applied toa one-dimensional scalar conservation law. This is followed by a description of the scheme applied to atwo-dimensional scalar conservation law and Euler equations on quadrilateral and triangular elements.

    A. One-Dimensional Formulation

    1. Problem Specification

    Consider the one-dimensional scalar conservation law,

    ∂u

    ∂t+∂f(u)

    ∂x= 0, x ∈ Ω = [a, b], t > 0, (1)

    where x is the spatial coordinate, t is time, u = u(x, t) is a conserved scalar quantity and f = f(u) is theflux. An initial condition is specified and Dirichlet and Neumann boundary conditions are introduced on theleft and right boundaries so that,

    u(x, 0) = u0(x),

    u(a, t) = ua,

    ∂f

    ∂x(b, t) = 0. (2)

    Following a traditional nodal finite element method, the domain is partitioned into Neles non-overlappingelements,

    Ω =

    Neles⋃ele=1

    Ωele, (3)

    where Ωele = [xele, xele+1). With the domain partitioned, the exact solution u and the exact flux f(u) canbe approximated by the numerical solution and the numerical flux,

    uδ =

    Neles∑ele=1

    uδele, fδ =

    Neles∑ele=1

    fδele. (4)

    2 of 33

    American Institute of Aeronautics and Astronautics

  • A linear isoparametric mapping is introduced from the physical domain x ∈ Ωele to the parent domainξ ∈ ΩS = [−1, 1) such that

    ξ(x|Ωele) = 2(

    x− xelexele+1 − xele

    )− 1,

    x|Ωele(ξ) =(

    1− ξ2

    )xele +

    (1 + ξ

    2

    )xele+1. (5)

    Applying this transformation gives rise to a transformed equation within the standard element ΩS of thefollowing form

    ∂ûδele∂t

    +1

    |Jele|∂f̂δele∂ξ

    = 0, (6)

    where

    ûδele = uδele(x|Ωele(ξ), t),

    f̂δele = fδele(x|Ωele(ξ), t),

    and |Jele| = 12 (xele+1 − xele) is the determinant of the geometric element Jacobian matrix of the coordinatetransformation. In what follows, the hat notation to denote the transformed solution and correspondingtransformed flux will be dropped for brevity.

    2. Direct Flux Reconstruction

    Consider the transformed semi-discrete equation for the one-dimensional scalar conservation law,

    ∂uδele∂t

    = − 1|Jele|δfδeleδξ

    , (7)

    whereδfδeleδξ is the numerical derivative of f

    δele. In the DFR method, a transformed globally C

    0 continuous

    flux fδ is reconstructed in order compute the numerical derivative of fδele in each element. The first stepis to further discretize each element by Nspts1D = P + 1 distinct solution points so that the discontinuoussolution in each element uδele can be represented by a piecewise interpolating polynomial of degree P ,

    uδele(ξ) =

    Nspts1D∑spt=1

    uδspt,ele `spt(ξ), (8)

    where {`1(ξ), . . . , `Nspts1D(ξ)} are the Lagrange polynomials defined at the solution points {ξ1, . . . , ξNspts1D}.This can be written in vector format as,

    uδele(ξ) = `(ξ)Tuδele, (9)

    where `(ξ)T = [`1(ξ), . . . , `Nspts1D(ξ)] and uδele = [u

    δ1,ele, . . . , u

    δNspts1D,ele

    ]T . To recover the nodal DG method,the solution points are chosen to be collocated with the zeros of the Legendre polynomial of degree P + 1,also known as the Gauss-Legendre points.13

    The next step is to extrapolate the discontinuous solution to the element interfaces using Eq.(9). Theextrapolated values in each element are written as,

    uδele(−1) = `(−1)Tuδele,uδele(+1) = `(+1)

    Tuδele. (10)

    The transformed common interface fluxes are computed by using the extrapolated discontinuous solution onboth sides of each interface as the left and right states in an appropriate numerical flux formulation for theequation being solved. The transformed common interface fluxes are written as

    fδ,Wele = fI(uδele−1(+1), u

    δele(−1)),

    fδ,Eele = fI(uδele(+1), u

    δele+1(−1)), (11)

    3 of 33

    American Institute of Aeronautics and Astronautics

  • where f I(uL, uR) is the interface flux function and fδ,Wele and f

    δ,Eele are the transformed common interface

    fluxes on the west and east boundaries of the eleth element, respectively. A Riemann solver is commonlyused as the interface flux function. In this paper, the Rusanov flux is computed so that

    f I(uL, uR) =1

    2(f(uR) + f(uL))−

    1

    2|λ(uL, uR)|(uR − uL), (12)

    where

    |λ(uL, uR)| = max(∣∣∣∣∂f∂u (uR)

    ∣∣∣∣ , ∣∣∣∣∂f∂u (uL)∣∣∣∣) , (13)

    and ∂f∂u (u) is the wavespeed or the derivative of the flux with respect to the solution. The transformed com-mon interface fluxes at Dirichlet and Neumann boundaries are computed by using the boundary conditionsspecified in Eq.(2),

    Dirichlet BC: fδ,W1 = f(ua),

    Neumann BC: fδ,ENeles = f(uδNeles

    (1)), (14)

    The next step is to construct a transformed continuous flux fδele such that a piecewise sum results ina transformed globally C0 continuous flux, fδ, that passes through the transformed common interface fluxvalues at element interfaces. This is accomplished using the following Lagrange interpolant,

    fδele(ξ) = fδ,Wele

    ˜̀0(ξ) +

    Nspts1D∑spt=1

    fδspt,ele˜̀spt(ξ) + f

    δ,Eele

    ˜̀P+2(ξ), (15)

    where {˜̀0(ξ), . . . , ˜̀P+2(ξ)} are the Lagrange interpolating polynomials of degree P + 2 defined at P + 3collocation points {−1, ξ1, . . . , ξNspts1D , 1}, fδspt,ele = f(uδspt,ele) is the transformed flux evaluated at solutionpoints and fδele is the resulting transformed continuous flux.

    The final step is to obtain the numerical derivative of fδele by differentiating Eq.(15) with respect to ξand evaluating at each solution point,

    δfδeleδξ

    (ξi) = fδ,Wele

    ∂ ˜̀0∂ξ

    (ξi) +

    Nspts1D∑spt=1

    fδspt,ele∂ ˜̀spt∂ξ

    (ξi) + fδ,Eele

    ∂ ˜̀P+2∂ξ

    (ξi). (16)

    This can be manipulated into a matrix-vector format so that,

    δfδeleδξ

    = DWξ fδ,Wele +Dξ f

    δele +D

    Eξ f

    δ,Eele , (17)

    where fδele = [fδ1,ele, . . . , f

    δNspts1D,ele

    ]T and Dξ ∈ R(Nspts1D×Nspts1D) and DWξ , DEξ ∈ R(Nspts1D×1) are polyno-mial differentiation operators such that

    Dξp,m =∂ ˜̀m∂ξ

    (ξp), p,m = 1, 2, . . . , Nspts1D,

    DWξp =∂ ˜̀0∂ξ

    (ξp), p = 1, 2, . . . , Nspts1D,

    DEξp =∂ ˜̀P+2∂ξ

    (ξp), p = 1, 2, . . . , Nspts1D. (18)

    This is coupled with Eq.(7) to obtain the transformed semi-discrete equation in vector format,

    ∂uδele∂t

    = − 1|Jele|δfδeleδξ

    . (19)

    B. Two Dimensional Extension to Quadrilateral Elements

    The DFR method, along with other Flux Reconstruction (FR) methods, can be directly extended to quadri-lateral elements using a tensor-product formulation.7 While the cited references describe the methodologyin the context of the standard FR method, the same procedure can be applied to the DFR method bysimply replacing the FR correction procedure using correction polynomials with the Lagrange interpolationdescribed by Eq.(15). A summary of the procedure is described below.

    4 of 33

    American Institute of Aeronautics and Astronautics

  • 1. Problem Specification

    Consider the two-dimensional scalar conservation law,

    ∂u

    ∂t+∂f(u)

    ∂x+∂g(u)

    ∂y= 0, (x, y) ∈ Ω, t > 0, (20)

    where Ω is an arbitrary domain, x and y are the spatial coordinates, t is time, u = u(x, y, t) is a conservedscalar quantity and f = f(u) and g = g(u) are the fluxes in the x and y directions, respectively. Aninitial condition is specified and Dirichlet and Neumann boundary conditions are introduced on arbitraryboundaries ∂ΩΘ and ∂ΩΦ, respectively, such that the entire boundary is ∂Ω = ∂ΩΘ

    ⋃∂ΩΦ and,

    u(x, y, 0) = u0(x, y),

    u(x, y, t) = uΘ(x, y), (x, y) ∈ ∂ΩΘ,∂fn

    ∂n(x, y, t) = 0, (x, y) ∈ ∂ΩΦ, (21)

    where n is the direction normal to ∂ΩΦ and fn is the component of the flux along n.

    Following a traditional nodal finite element method, the domain is partitioned into Neles non-overlapping,conforming quadrilateral elements,

    Ω =

    Neles⋃ele=1

    Ωele. (22)

    Each quadrilateral element in the physical domain (x, y) is mapped to a reference element in the transformedparent space (ξ, η) so that, (

    x

    y

    )= Γele(ξ, η) =

    Nnpts∑npt=1

    Mnpts(ξ, η)

    (xnpt,eleynpt,ele

    )(23)

    where Mnpts(ξ, η) are the element shape functions and Nnpts is the number of points used to define thephysical space element.

    x1,ele

    x2,elebb

    b

    b

    b b

    bb

    x4,ele

    x3,ele

    (−1,−1)

    (1, 1)(−1, 1)

    (1,−1)

    x = Γele(ξ, η)

    y

    x ξ

    η

    Figure 1: Mapping of physical space quadrilateral element to reference quadrilateral using mapping Γele(ξ, η)

    Applying this transformation gives rise to a transformed equation of the following form,

    ∂ûδele∂t

    +1

    |Jele|

    (∂f̂δele∂ξ

    +∂ĝδele∂η

    )= 0, (24)

    5 of 33

    American Institute of Aeronautics and Astronautics

  • where

    ûδele = uδele(Γele(ξ, η), t),

    f̂δele =∂y

    ∂ηfδele(Γele(ξ, η), t)−

    ∂x

    ∂ηgδele(Γele(ξ, η), t),

    ĝδele = −∂y

    ∂ξfδele(Γele(ξ, η), t) +

    ∂x

    ∂ξgδele(Γele(ξ, η), t),

    and the terms, |Jele|, ∂y∂η , ∂x∂η ,∂y∂ξ and

    ∂x∂ξ are computed in each element from Eq.(23). In what follows, the

    hat notation to denote the transformed solution and corresponding transformed fluxes will be dropped forbrevity.

    2. Direct Flux Reconstruction

    Consider the transformed semi-discrete equation for the two-dimensional scalar conservation law,

    ∂uδele∂t

    = − 1|Jele|

    (δfδeleδξ

    +δgδeleδη

    ), (25)

    whereδfδeleδξ and

    δgδeleδη are the numerical derivatives of f

    δele and g

    δele, respectively. The DFR method for 2D

    quadrilateral elements is similar to the method for 1D. The first step is to further discretize each quadrilateralelement by Nspts = (P+1)

    2 distinct solution points generated through a tensor product of a set of 1D solutionpoints. Each solution point is defined by the sets {ξ1, . . . , ξNspts1D} and {η1, . . . , ηNspts1D}. The discontinuoussolution in each element uδele can be represented by a product of piecewise interpolating polynomials of degreeP ,

    uδele(ξ, η) =

    Nspts∑spt=1

    uδspt,eleφspt(ξ, η), (26)

    where φspt(ξ, η) = `i(ξ)`j(η) and `i(ξ) and `j(η) are 1D Lagrange polynomials defined at the sptth solution

    point located at (ξi, ηj). This can be written in vector format as,

    uδele(ξ, η) = φ(ξ, η)Tuδele, (27)

    where φ(ξ, η)T = [`1(ξ)`1(η), . . . , `Nspts1D(ξ)`Nspts1D(η)] and uδele = [u

    δ1,ele, . . . , u

    δNspts,ele

    ]T .The next step is to extrapolate the discontinuous solution to Nspts1D = P + 1 distinct flux points on each

    edge of a quadrilateral element for a total of Nfpts = 4Nspts1D flux points. Using Eq.(27), the extrapolatedvalues in each element are written as,

    uδ,Wele = EW uδele, u

    δ,Eele = E

    E uδele,

    uδ,Sele = ES uδele, u

    δ,Nele = E

    N uδele, (28)

    where uδ,Wele ,uδ,Eele ,u

    δ,Sele ,u

    δ,Nele ∈ R(Nspts1D×1) are the extrapolated discontinuous solution vectors on the west,

    east, south and north boundaries of the eleth element, respectively, as shown in Figure 2 andEW ,EE ,ES ,EN ∈R(Nspts1D×Nspts) are polynomial extrapolation operators such that

    EWp,m = φm(−1, ηp), p = 1, 2, . . . , Nspts1D, m = 1, 2, . . . , Nspts,EEp,m = φm(+1, ηp), p = 1, 2, . . . , Nspts1D, m = 1, 2, . . . , Nspts,

    ESp,m = φm(ξp,−1), p = 1, 2, . . . , Nspts1D, m = 1, 2, . . . , Nspts,ENp,m = φm(ξp,+1), p = 1, 2, . . . , Nspts1D, m = 1, 2, . . . , Nspts, (29)

    Transformed common interface fluxes that are normal to the element faces are computed by using theextrapolated discontinuous solution on both sides of each interface as the left and right states in a commoninterface function. The transformed common interface fluxes are written as

    fδ,Wele = dAWele f

    I(uδ,EeleN,uδ,Wele ), f

    δ,Eele = dA

    Eele f

    I(uδ,Eele ,uδ,WeleN ),

    gδ,Sele = dASele f

    I(uδ,NeleN,uδ,Sele ), g

    δ,Nele = dA

    Nele f

    I(uδ,Nele ,uδ,SeleN), (30)

    6 of 33

    American Institute of Aeronautics and Astronautics

  • r

    r

    r

    r b

    L

    R

    R R

    R

    L

    L

    L

    S

    N

    EW

    br rb

    r

    r

    b

    Figure 2: A visual representation of a quadrilateral element in parent space for a polynomial order of P = 1.The solution points are marked by blue circles, the flux points are marked by red squares and west, east,south and north faces are represented by W,E, S,N , respectively. Left and right states in an interface fluxare represented by L and R.

    where ”eleN” refers to the neighboring element and dAWele ,dAEele,dA

    Sele,dA

    Nele ∈ R(Nspts1D×Nspts1D) are diag-

    onal matrices that transform the common interface fluxes such that

    dAWele,p,p =∣∣JWele,p∣∣ ∣∣(JWele,p)−T n̂W ∣∣ , p = 1, 2, . . . , Nspts1D,

    dAEele,p,p =∣∣JEele,p∣∣ ∣∣(JEele,p)−T n̂E∣∣ , p = 1, 2, . . . , Nspts1D,

    dASele,p,p =∣∣JSele,p∣∣ ∣∣(JSele,p)−T n̂S∣∣ , p = 1, 2, . . . , Nspts1D,

    dANele,p,p =∣∣JNele,p∣∣ ∣∣(JNele,p)−T n̂N ∣∣ , p = 1, 2, . . . , Nspts1D, (31)

    where JWele,p, JEele,p, J

    Sele,p, J

    Nele,p are the geometric element Jacobian matrices evaluated at the p

    th flux point

    and n̂W , n̂E , n̂S , n̂N are the unit normals in parent space. The Rusanov flux used for the common interfacefunction from Eq.(12) now becomes

    f I(uL, uR) =1

    2(fn(uR) + f

    n(uL))−1

    2|λ(uL, uR)|(uR − uL), (32)

    where fn(u) is the flux normal to the face and

    |λ(uL, uR)| = max(∣∣∣∣∂fn∂u (uR)

    ∣∣∣∣ , ∣∣∣∣∂fn∂u (uL)∣∣∣∣) , (33)

    where ∂fn

    ∂u (u) is the wavespeed of the normal flux or the derivative of the normal flux with respect to thesolution. The transformed common interface fluxes at Dirichlet and Neumann boundaries defined in Eq.(21)are computed as,

    Dirichlet BC: fδ,Θ = dAΘ fn(uΘ),

    Neumann BC: fδ,Φ = dAΦ fn(uδ,Φ), (34)

    where uΘ is a vector of uΘ(x, y) evaluated at the boundary flux points, uδ,Φ is a vector of extrapolated

    solutions on the boundary flux points and dAΘ,dAΦ are diagonal matrices that transform the normal fluxesat the boundaries.

    7 of 33

    American Institute of Aeronautics and Astronautics

  • The transformed continuous fluxes in each element, fδele and gδele, are constructed such that they pass

    through the transformed common interface fluxes at flux points by using Lagrange interpolants. The nu-merical derivative of the transformed continuous fluxes evaluated at each solution point can then be writtenas

    δfδeleδξ

    (ξi, ηj) =

    Nspts1D∑fpt=1

    fδ,Wele,fpt∂ ˜̀0∂ξ

    (ξi) ˜̀fpt(ηj) +

    Nspts1D∑m=1

    Nspts1D∑p=1

    fδele(ξp, ηm)∂ ˜̀p∂ξ

    (ξi) ˜̀m(ηj)

    +

    Nspts1D∑fpt=1

    fδ,Eele,fpt∂ ˜̀P+2∂ξ

    (ξi) ˜̀fpt(ηj),

    δgδeleδη

    (ξi, ηj) =

    Nspts1D∑fpt=1

    gδ,Sele,fpt˜̀fpt(ξi)

    ∂ ˜̀0∂η

    (ηj) +

    Nspts1D∑m=1

    Nspts1D∑p=1

    gδele(ξp, ηm)˜̀p(ξi)

    ∂ ˜̀m∂η

    (ηj)

    +

    Nspts1D∑fpt=1

    gδ,Nele,fpt˜̀fpt(ξi)

    ∂ ˜̀P+2∂η

    (ηj), (35)

    where {˜̀0(ξ), . . . , ˜̀P+2(ξ)} and {˜̀0(η), . . . , ˜̀P+2(η)} are the 1D Lagrange interpolating polynomials of degreeP+2 defined at P+3 collocation points {−1, ξ1, . . . , ξNspts1D , 1} and {−1, η1, . . . , ηNspts1D , 1}, respectively, andfδele(ξp, ηm) =

    ∂y∂ηf(u

    δele(ξp, ηm))− ∂x∂η g(uδele(ξp, ηm)) and gδele(ξp, ηm) = −

    ∂y∂ξ f(u

    δele(ξp, ηm))+

    ∂x∂ξ g(u

    δele(ξp, ηm))

    are the transformed fluxes evaluated at solution points. This can be manipulated into a matrix-vector formatso that,

    δfδeleδξ

    = DWξ fδ,Wele +Dξ f

    δele +D

    Eξ f

    δ,Eele ,

    δgδeleδη

    = DSη gδ,Sele +Dη g

    δele +D

    Nη g

    δ,Nele , (36)

    where fδele = [fδ1,ele, . . . , f

    δNspts,ele

    ]T , gδele = [gδ1,ele, . . . , g

    δNspts,ele

    ]T and Dξ, Dη ∈ R(Nspts×Nspts) and DWξ , DEξ ,DSη , D

    Nη ∈ R(Nspts×Nspts1D) are polynomial differentiation operators. This is coupled with Eq.(25) to obtain

    the transformed semi-discrete equation in vector format,

    ∂uδele∂t

    = − 1|Jele|

    (δfδeleδξ

    +δgδeleδη

    ). (37)

    3. Extension of Tensor Product Formulation to Triangular Elements

    The tensor product formulation of the DFR method on quadrilaterals can be directly extended to triangularelements using an edge-collapsing method.23 In the cited reference, the treatment of ghost flux points,defined as the flux points co-located at the collapsed vertex is discussed. For first-order fluxes, the commoninterface flux at these points is set to zero since there is no face area at these points. A visual depiction ofthese elements can be seen in Figure 3.

    C. Euler Equations

    1. Problem Specification

    Consider the unsteady, two-dimensional, Euler equations in conservative form,

    ∂U

    ∂t+∂F

    ∂x+∂G

    ∂y= 0, (38)

    U =

    ρ

    ρu

    ρv

    e

    , F =

    ρu

    ρu2 + p

    ρuv

    (e+ p)u

    , G =

    ρv

    ρuv

    ρv2 + p

    (e+ p)v

    , (39)

    8 of 33

    American Institute of Aeronautics and Astronautics

  • x1,elex2,ele

    x3,ele

    x = Γele(ξ, η)

    y

    x ξ

    η

    b b

    bbr

    r r

    r

    r

    rr

    rs

    rs

    x4,ele

    rbbrr s

    r

    b

    r

    b

    r

    Figure 3: Mapping of physical space triangular element to reference quadrilateral using mapping Γ(ξ, η).Hollow red squares depict interface ghost flux points.

    where ρ is density, u, v are the velocity components in the x, y directions, respectively, and e is total energyper unit volume. The pressure is determined from the equation of state,

    p = (γ − 1)(e− 1

    2ρ(u2 + v2

    )), (40)

    where γ is the ratio of specific heats.

    2. Direct Flux Reconstruction

    The DFR method can be directly applied to Eq.(37) so the transformed semi-discrete equation becomes,

    ∂Uδele∂t

    = − 1|Jele|

    (δF δeleδξ

    +δGδeleδη

    ). (41)

    The formulation of the discontinuous solution and the extrapolation procedure of the solution to flux pointsfollows exactly as described in the previous section. The transformed common interface fluxes are alsocomputed the same as before. For the Euler equations, the Rusanov flux used for the common interfacefunction now becomes,

    F I(UL, UR) =1

    2(Fn(UR) + F

    n(UL))−1

    2|λ(UL, UR)|(UR − UL), (42)

    where Fn(U) is the flux normal to the face and

    |λ(UL, UR)| = max (|V nR |+ cR, |V nL |+ cL) , (43)where V n is the velocity normal to the face and c is the speed of sound. The boundary conditions used atthe boundary faces are shown in the appendix.

    Following from Eq.(36), the numerical derivatives of the transformed continuous fluxes evaluated ateach solution point for each variable can be written in a matrix-vector format. Consider arranging thenumerical solution in each element into a vector of (Nspts × 1) values for each conservative variable so thatUele,var = [Uele,var,1, Uele,var,2, . . . , Uele,var,Nspts ]

    T where ”var” represents an index for a solution variable. Thenumerical derivative can then be written as,

    δF δele,varδξ

    = DWξ Fδ,Wele,var +D

    Dξ F

    δele,var +D

    Eξ F

    δ,Eele,var,

    δGδele,varδη

    = DSη Gδ,Sele,var +D

    Dη G

    δele,var +D

    Nη G

    δ,Nele,var, (44)

    9 of 33

    American Institute of Aeronautics and Astronautics

  • This can be applied directly to Eq.(41) so that the transformed semi-discrete equation in vector formatbecomes,

    ∂Uδele,var∂t

    = − 1|Jele|

    (δF δele,varδξ

    +δGδele,var

    δη

    ). (45)

    For the remainder of the paper, the delta notation to denote the numerical approximation of solution andflux will be dropped for brevity.

    III. Time-Stepping Schemes

    The fully discrete equation in each element is obtained by substituting the exact time derivative term inEq.(41) with the numerical time derivative,

    δUeleδt

    = R(Uele, UeleN), (46)

    where,

    R(Uele, UeleN) = −1

    |Jele|

    (δFeleδξ

    +δGeleδη

    ), (47)

    and UeleN is the set of all neighboring solution point values needed for the residual R.

    A. Explicit Method

    An explicit, four-stage Runge-Kutta (RK) scheme is used to update the solution in all elements at eachstage,

    Res(1) = R(Us),

    Res(2) = R(Us + ∆t Res(1)),

    Res(3) = R(Us +1

    2∆t Res(2)),

    Res(4) = R(Us +1

    2∆t Res(3)),

    Us+1 = Us +1

    6Res(1) +

    1

    3Res(2) +

    1

    3Res(3) +

    1

    6R(Us + ∆t Res(4)) (48)

    where ∆t is the numerical time step and the subscript ”ele” has been omitted to signify that the residualand update computations happen on all elements.

    A timestep based on the Courant-Friedrich-Lewy (CFL) condition can be computed as,

    ∆tele =CFL Vele∮∂Ωele|λ| dA

    , (49)

    where Vele is the element volume and ∂Ωele refers to the element boundary.

    B. Implicit Method

    An implicit, backward Euler scheme is used to find the solution in each element at the next time step,

    ∆Uele = ∆t R(Us+1ele , U

    s+1eleN) (50)

    where ∆Uele = Us+1ele − Usele. A Taylor series expansion of R(Us+1ele , Us+1eleN) is used to linearize the equation,

    R(Us+1ele , Us+1eleN) ≈ R(Usele, UseleN) +

    ∂Rsele∂Uele

    ∆Uele +∑eleN

    ∂Rsele∂UeleN

    ∆UeleN, (51)

    where Rsele = R(Usele, U

    seleN). Rearranging Eq.(50) by using the approximation in Eq.(51) gives the global

    linear system, (I

    ∆t+∂Rsele∂Uele

    )∆Uele −

    ∑eleN

    ∂Rsele∂UeleN

    ∆UeleN = R(Usele, U

    seleN). (52)

    10 of 33

    American Institute of Aeronautics and Astronautics

  • In order to parallelize the linear solver and eliminate the dependency of neighboring elements on the left-handside matrix, a multicolored Gauss-Seidel (MCGS) algorithm is used,(

    I

    ∆t+∂Rsele∂Uele

    )∆Uk+1ele = R(U

    sele, U

    seleN) +

    ∑eleN

    ∂Rsele∂UeleN

    ∆U∗eleN (53)

    where ∆Uk+1ele refers to the ∆U of an element with the current color and ∆U∗eleN refers to the most recently

    updated ∆U of neighboring elements. The solution Usele is updated to Us+1ele after all colors have been

    updated. For example, in a two color, red-black Gauss-Seidel the algorithm is: update ∆U on red elements,update ∆U on black elements, update Usele on all elements.

    The right-hand side can be further reduced by using the following linear approximation,

    R(Usele, U∗eleN) ≈ R(Usele, UseleN) +

    ∑eleN

    ∂Rsele∂UeleN

    ∆U∗eleN, (54)

    so that, (I

    ∆t+∂Rsele∂Uele

    )∆Uk+1ele = R(U

    sele, U

    ∗eleN). (55)

    The solution Usele must now be updated as soon as a color has been updated in order to compute a newresidual. It’s also possible to perform a backsweep of all colors. In this case, the equation becomes(

    I

    ∆t+∂Rsele∂Uele

    )∆Uk+1ele = R(U

    ∗ele, U

    ∗eleN). (56)

    where ∆Uk+1ele = Uk+1ele − U∗ele.

    C. Computation of the Jacobian Matrix

    From Eq.(47), the implicit Jacobian matrices, ∂Rele∂Uele , of size (NsptsNvars × NsptsNvars) can be computedanalytically,

    ∂Rele∂Uele

    = − 1|Jele|

    δξ

    (∂Fele∂Uele

    )+

    δ

    δη

    (∂Gele∂Uele

    )), (57)

    where δδξ

    (∂Fele∂Uele

    )and δδη

    (∂Gele∂Uele

    )are both of size (NsptsNvars×NsptsNvars). The numerical derivatives follow

    directly from a modification of Eq.(44),[δ

    δξ

    (∂Fele∂Uele

    )]i,j

    = DWξ

    [∂FWele∂UWele

    ]i,j

    [∂UWele∂Uele

    ]i,j

    +Dξ

    [∂Fele∂Uele

    ]i,j

    +DEξ

    [∂FEele∂UEele

    ]i,j

    [∂UEele∂Uele

    ]i,j

    ,[δ

    δη

    (∂Gele∂Uele

    )]i,j

    = DSη

    [∂GSele∂USele

    ]i,j

    [∂USele∂Uele

    ]i,j

    +Dη

    [∂Gele∂Uele

    ]i,j

    +DNη

    [∂GNele∂UNele

    ]i,j

    [∂UNele∂Uele

    ]i,j

    , (58)

    where i, j refers to a single component of an (Nvars×Nvars) derivative matrix. The transformed flux deriva-tives are diagonal matrices of size,[

    ∂Fele∂Uele

    ]i,j

    ,

    [∂Gele∂Uele

    ]i,j

    ∈ R(Nspts×Nspts),[∂FWele∂UWele

    ]i,j

    ,

    [∂FEele∂UEele

    ]i,j

    ,

    [∂GSele∂USele

    ]i,j

    ,

    [∂GNele∂UNele

    ]i,j

    ∈ R(Nspts1D×Nspts1D). (59)

    The derivatives of the solution at flux points with respect to the solution at solution points follows directlyfrom Eq.(28) so that Eq.(58) becomes,[

    δ

    δξ

    (∂Fele∂Uele

    )]i,j

    = DWξ

    [∂FWele∂UWele

    ]i,j

    EW +Dξ

    [∂Fele∂Uele

    ]i,j

    +DEξ

    [∂FEele∂UEele

    ]i,j

    EE ,[δ

    δη

    (∂Gele∂Uele

    )]i,j

    = DSη

    [∂GSele∂USele

    ]i,j

    ES +Dη

    [∂Gele∂Uele

    ]i,j

    +DNη

    [∂GNele∂UNele

    ]i,j

    EN . (60)

    11 of 33

    American Institute of Aeronautics and Astronautics

  • ∂Fele∂Uele

    = ∂y∂η∂F∂U (Uele) − ∂x∂η ∂G∂U (Uele) and ∂Gele∂Uele = −

    ∂y∂ξ

    ∂F∂U (Uele) +

    ∂x∂ξ

    ∂G∂U (Uele) are the derivatives of the trans-

    formed fluxes with respect to the solution at solution points. The derivative of the fluxes, ∂F∂U (U) and∂G∂U (U),

    are well known for the two-dimensional Euler equations and are shown in the appendix.∂FWele∂UWele

    ,∂FEele∂UEele

    ,∂GSele∂USele

    ,

    ∂GNele∂UNele

    are derivatives of the transformed common interface fluxes with respect to the solution at flux points.

    These are found by differentiating Eq.(30),[∂FWele∂UWele

    ]i,j

    = dAWele

    [∂F I

    ∂UR

    (UEeleN, U

    Wele

    )]i,j

    ,

    [∂FEele∂UEele

    ]i,j

    = dAEele

    [∂F I

    ∂UL

    (UEele, U

    WeleN

    )]i,j

    ,[∂GSele∂USele

    ]i,j

    = dASele

    [∂F I

    ∂UR

    (UNeleN, U

    Sele

    )]i,j

    ,

    [∂GNele∂UNele

    ]i,j

    = dANele

    [∂F I

    ∂UL

    (UNele, U

    SeleN

    )]i,j

    , (61)

    where each i, j component of the derivatives of the interface flux function are diagonal matrices of size(Nspts1D × Nspts1D). The derivative of the interface flux function or, in this case, the Rusanov flux iscomputed with respect to the left state and right state solution so that,

    ∂F I

    ∂UL(UL, UR) =

    1

    2

    ∂Fn

    ∂U(UL)−

    1

    2

    ∂UL(|λ(UL, UR)|(UR − UL)) ,

    ∂F I

    ∂UR(UL, UR) =

    1

    2

    ∂Fn

    ∂U(UR)−

    1

    2

    ∂UR(|λ(UL, UR)|(UR − UL)) , (62)

    where ∂Fn

    ∂U (U) is the derivative of the flux normal to the face. The derivative of the term with the wavespeedcan be split into a piecewise function so that

    ∂UL(|λ| (UR − UL)) = (UR − UL)

    ∂|λ|∂UL

    T

    − |λ| I,

    ∂UR(|λ| (UR − UL)) = (UR − UL)

    ∂|λ|∂UR

    T

    + |λ| I, (63)

    where I represents the identity matrix and

    ∂|λ|∂U

    =

    ∂∂U (|V n|+ c) if (|V n|+ c) > 0,0 otherwise.The derivative of the wavespeed is computed as

    ∂U(|V n|+ c) =

    −sgn(V n)V nρ − c2ρ +

    γ(γ−1)(u2+v2)4ρc

    sgn(V n)nxρ −γ(γ−1)u

    2ρc

    sgn(V n)nyρ −

    γ(γ−1)v2ρc

    γ(γ−1)2ρc

    , (64)

    where sgn(x) is the signum function and nx, ny are the x and y components of the unit normal vector n.

    It’s important to note that the derivative of the wavespeed is not defined when ∂|λ|∂UL =∂|λ|∂UR

    .The transformed common interface flux derivatives at boundaries are computed by taking the derivative

    of the boundary condition with respect to the solution at flux points and transforming. This operation isshown in the appendix. Additionally for triangular elements, the common interface flux derivatives are setto zero at ghost flux points since there is no flux.

    IV. Implementation

    The proposed implicit scheme has been implemented within ZEFR, an existing in-house solver utilizingthe DFR method and explicit time integration to solve the Euler and Navier-Stokes equations. The codecurrently supports simulations in 2D using quadrilateral and triangular elements and in 3D using hexahedral

    12 of 33

    American Institute of Aeronautics and Astronautics

  • elements. The CPU implementation is written in C++, supporting shared memory parallel execution usingOpenMP and distributed parallel operation using MPI. The GPU implementation is programmed in mixedCUDA C and C++, with support for distributed multi-GPU operation using MPI. An overview of theexisting implementation for the Euler equations using explicit timestepping and the required modificationsfor the implicit methodology is given in the following section.

    A. Explicit Implementation

    For an explicit computation, the central component of the implementation is the procedure to compute theresidual, R(U), to be used in the multistage Runge-Kutta solution update, seen in Eq.(48). Previous authorshave written extensively on GPU implementations of FR schemes using explicit timestepping, so much ofthe discussion in this section is review.14,15

    1. Data Structures and Layout

    To start, a description of the data structures used in the software implementation will be described. Whilethe mathematical description provided in the previous section contains operations from an element-local per-spective, for the greatest computational efficiency, the algorithm should be expressed using global operationswherever possible. This necessitates the definition of global solution and flux arrays that collect all of theelement-local vector data into single data structures. Using the DFR method for the Euler equations, thereare five of these structures: Uspts, Ufpts, Fspts, F

    Ifpts and (∇ ·F )spts where U denotes the solution, F denotes

    the flux, F I denotes the common interface flux, and (∇·F ) denotes the divergence of the flux. The subscript“spts” denotes data at the solution points and “fpts” denotes data at the flux points. The dimensions ofthese data arrays are as follows

    Uspts : [Nspts, Neles, Nvars]

    Ufpts : [Nfpts, Neles, Nvars]

    Fspts : [Nspts, Neles, Nvars, Ndims]

    F Ispts : [Nfpts, Neles, Nvars]

    (∇ · F )spts : [Nspts, Neles, Nvars]

    Unless otherwise stated, all data structures are arranged in a column-major format. As a representativeexample, Figure 4 depicts the data layout of Uspts. A key point to note is that the data is organized ina structure of arrays (SoA) format, where data associated with each variable is contiguous in memory.This layout proves beneficial on GPUs (and on CPUs using vector units) since it allows for operations oncoallesed data. Connecting this back to the element-local perspective, each column in these data structuresis associated with a single element in the domain. For the data at flux points, the data vectors associatedwith the faces of a given element are concatenated into a single column in the global data structures. Forexample, a column in Ufpts is set as,

    Ufpts(:, ele, var) =

    USele, varUEele, varUNele, varUWele, var

    , (65)where the colon operator indicates the entire range of data in that dimension.

    With the global solution and flux point data structures specified, corresponding global operator matricesfor solution extrapolation and polynomial differentiation can be defined. In the previous sections, severalmatrix operators were defined to perform these operations on a per-element basis using matrix-vector prod-ucts. The operators EN , ES , EE , EW , defined in Eq. (31), are used to extrapolate solution point datato flux points on a per-face basis. These operators can be combined into a single operator E of dimension

    13 of 33

    American Institute of Aeronautics and Astronautics

  • Uspts = Nspts

    Neles

    V ar0 V ar1 V ar3V ar2

    Figure 4: Data layout of Uspts array

    (Nfpts ×Nspts) by vertically concatenating the existing operators

    E =

    ES

    EE

    EN

    EW

    (66)Similarly, the flux point polynomial differentiation operators,DNη ,D

    Sη ,D

    Eξ ,D

    Wξ , defined in Eqs.(35) and

    (37), can be combined into a single operator ∇fpts of dimension (Nspts×Nfpts) by horizontally concatenatingthe existing operators

    ∇fpts =[DSη D

    Eξ D

    Nη D

    ](67)

    The ∇fpts operator is used to compute the contribution of the transformed common interface flux to thedivergence of the flux. The solution point polynomial differentiation operators, Dξ, Dη, also defined inEqs.(35) and (37) can be maintained as defined.

    2. Computation of the Residual

    With the global data structures and related operators defined, the procedure to compute the residual, R(U),can be completed in the following steps:

    1. Extrapolate the solution at solution points, Uspts, to solution at flux points, Ufpts via global matrix-matrix multiplication

    Ufpts(:, :, var) = EUspts(:, :, var)

    2. Compute transformed common numerical fluxes at flux points using Eq.(42) and left/right state vari-ables via flux point pairwise operations.

    3. Compute transformed Euler flux at solution points using Eq.(39) via solution pointwise operations.

    4. Compute divergence of flux at solution points using global matrix-matrix multiplication

    (∇ · F )spts(:, :, var) = DξFspts(:, :, var, 0) +DηFspts(:, :, var, 1) + ∇fptsF Ifpts(:, :, var)

    where dimension index 0 corresponds to the ξ direction and index 1 corresponds to the η direction. Notethat during the RK stage update, the divergence of the flux is divided by the determinant of the Jacobianat each solution point to form the complete residual.

    With this framework in place, the software implementation on CPU and GPU can be implementedusing only a few major tasks. To perform steps involving global matrix-matrix multiplications, one canutilize one of the many high-performance BLAS libraries available on CPU and GPU. For this study, theOpenBLAS library on CPUs and CUBLAS library on GPUs are utilized.24,25 For the remaining steps,custom functions/kernels must be developed. For extensive discussion on how to develop high-performancekernels for these tasks, see papers by Castonguay et al. and Witherden et al.14,15

    14 of 33

    American Institute of Aeronautics and Astronautics

  • 3. Multi-GPU Extension Using MPI

    To enable the distribution of the algorithm onto multiple GPUs, communication routines using MPI areutilized. First, the computational domain is partitioned using METIS with each partition assigned to asingle GPU. Computations are carried out independently within each partition, with coupling occurringonly at the flux points shared between partitions, where solution state information from the neighboringpartition must be communicated to compute transformed common numerical fluxes.

    The modified procedure to compute the residual over multiple GPUs is:

    1. Extrapolate the solution at solution points, Uspts, to solution at flux points, Ufpts via global matrix-matrix multiplication

    Ufpts(:, :, var) = EUspts(:, :, var)

    2. On each partition, pack buffer of partition boundary flux point solution data on GPU, copy data fromGPU to host CPU, and commence non-blocking MPI send/receive of data between partitions.

    3. During non-blocking transfer:

    (a) Compute transformed common numerical fluxes at partition internal flux points using Eq.(42)and left/right state variables via flux point pairwise operations.

    (b) Compute transformed Euler flux at solution points using Eq.(39) via solution pointwise operations.

    (c) Compute solution point contribution to divergence of flux at solution points using global matrix-matrix multiplication

    (∇ · F )spts(:, :, var) = DξFspts(:, :, var, 0) +DηFspts(:, :, var, 1)

    4. Once MPI communication is complete, copy data to GPU from the host CPU and unpack the buffer.

    5. Compute transformed common numerical fluxes at partition boundary flux points using Eq.(42) andleft/right state variables via flux point pairwise operations.

    6. Add flux point contribution to divergence of flux at solution points using global matrix-matrix multi-plication

    (∇ · F )spts(:, :, var) += ∇fptsF Ifpts(:, :, var)

    The use of non-blocking MPI communication routines allows useful work to be completed while datais transferred between partitions, masking the impact of host to host latency. Additionally, the bufferpack/unpack operations and data transfer between the host and device are placed into a separate stream onthe GPU, with the data transfers completed using asynchronous memcopy routines in CUDA. This allowsthese operations to be performed concurrently with the main residual computation, further masking theimpact of the communication.

    B. Implicit Implementation

    In the following section, the implementation details for the proposed MCGS implicit scheme are provided.

    1. Data Structures and Layout

    For the implicit implementation, several new data structures are introduced in a similar layout to the existingdata structures used in the explicit solver. The new data structures are DFDUspts, DFDU

    Ifpts, ∆Uspts, RHS

    and LHS, where DFDU denotes the transformed flux derivatives with respect to the solution, DFDU I

    denotes the transformed common interface flux derivatives with respect to the solution, ∆U denotes thesolution update, RHS denotes the right-hand side of the implicit update equation, and LHS denotes thedata structure containing the final element-local left-hand side matrices. The dimensions of these data arrays

    15 of 33

    American Institute of Aeronautics and Astronautics

  • are

    DFDUspts : [Nspts, Nvars, Nvars, Neles, Ndims]

    DFDU Ifpts : [Nfpts, Nvars, Nvars, Neles]

    ∆Uspts : [Nspts, Nvars, Neles]

    RHS : [Nspts, Nvars, Neles]

    LHS : [Nspts, Nvars, Nspts, Nvars, Neles]

    The flux derivative data structures are similar to the existing flux data structures; however, for these datastructures, there are a total of (Nvars ×Nvars ×Ndims) values per solution/flux point. The data is organizedin the same SoA format as the flux data, with the individual Jacobian terms treated as separate variables.Explicitly, the values of DFDUspts and DFDU

    Ifpts are defined as

    DFDUspts(:, i, j, ele, 0) =

    [∂Fele∂Uele

    ]i,j

    ~1

    DFDUspts(:, i, j, ele, 1) =

    [∂Gele∂Uele

    ]i,j

    ~1

    DFDU Ifpts(:, i, j, ele)) =

    [∂FSele∂USele

    ]i,j

    ~1[∂FEele∂UEele

    ]i,j

    ~1[∂FNele∂UNele

    ]i,j

    ~1[∂FWele∂UWele

    ]i,j

    ~1

    which extract and store the diagonal terms of the transformed flux derivative matrices as columns in thedata arrays. The memory layout of the LHS data structure can be seen in Figure 5. From the figure, notethat the element-local left-hand side matrices are horizontally concatenated to create the global structure.Furthermore, each element matrix is split into (Nspts × Nspts) subblocks, one block for each (i, j) variablepair.

    LHS =

    Nspts ×Nvars

    Nspts

    Ele0 Ele1

    V ar0 V ar0V ar1 V ar1

    V ar0

    V ar1

    Figure 5: Sample data layout of LHS array with two variables and two colors

    16 of 33

    American Institute of Aeronautics and Astronautics

  • 2. MCGS Iteration

    The procedure to complete one MCGS iteration is as follows:

    1. Compute the residual over the whole domain.

    2. Compute the Jacobian matrices and form element-local left-hand side matrices. Store in LHS.

    3. Perform LU factorization on LHS matrices (CPU) OR compute inverses of LHS matrices (GPU).

    4. In loop over colors:

    (a) Compute the residual for the current color only, store in RHS.

    (b) Compute ∆Uspts for current color via triangular solves of LU factored LHS matrices (CPU) orbatched matrix-vector multiplication by LHS inverses (GPU). See Eq. (55) for system.

    (c) Add ∆Uspts to Uspts of current color.

    3. Constructing the left-hand side matrices

    The construction of the element-local left-hand side matrices is carried out in several steps:

    1. Compute transformed flux derivatives, DFDUspts at the solution points by applying the analyticexpressions given in Eq.(??) to Uspts.

    2. Compute transformed common interface flux derivatives, DFDU Ifpts, at flux points using Eq.(62) andleft/right state variables via flux point pairwise operations.

    3. Construct element local LHS entries:

    (a) Combine DFDUspts and DFDUIfpts into Jacobian subblocks via Eq.(60). Store in corresponding

    LHS location.

    (b) Form complete LHS matrices in place via Eqs.(55) and (57)

    The first point to note about this procedure is that the computation of the transformed flux derivativeterms at the solution and flux points follows the exact same structure as the existing flux computations.It then follows that one can utilize a nearly identical kernel structure to compute these values, swappingin new expressions as appropriate. This is exactly what is done to compute these terms in the currentimplementation.

    A second point is in regards to parallel operation on multiple CPUs/GPUs within a partitioned domain.As with the common interface flux in the residual computation, the computation of the transformed commonflux derivative terms uses left and right state information which requires coupling at the flux points onpartition boundaries through MPI communication. However, since the Jacobian is constructed following acall to compute the residual over the entire domain, the left and right state variables will have already beentransferred, allowing the flux derivative computation to continue without any additional MPI communication.

    This leaves only the final step, the computation of the Jacobian sublocks and formation of the completedLHS for discussion. Consider Eq.(60) which describes the contributions to the subblock Jacobian matricesby spatial dimension, repeated here for convenience[

    δ

    δξ

    (∂Fele∂Uele

    )]i,j

    = DWξ

    [∂FWele∂UWele

    ]i,j

    EW +Dξ

    [∂Fele∂Uele

    ]i,j

    +DEξ

    [∂FEele∂UEele

    ]i,j

    EE ,[δ

    δη

    (∂Gele∂Uele

    )]i,j

    = DSη

    [∂GSele∂USele

    ]i,j

    ES +Dη

    [∂Gele∂Uele

    ]i,j

    +DNη

    [∂GNele∂UNele

    ]i,j

    EN ,

    These equations can be expressed using global operators and data structures as[δ

    δξ

    (∂Fele∂Uele

    )+

    δ

    δη

    (∂Gele∂Uele

    )]i,j

    = Dξ diag[DFDUspts(:, ele, i, j, 0)]

    +Dη diag[DFDUspts(:, ele, i, j, 1)]

    + ∇fpts diag[DFDU Ifpts(:, ele, i, j)]E (68)

    17 of 33

    American Institute of Aeronautics and Astronautics

  • where

    diag[DFDUspts(:, ele, i, j, 0)] =

    [∂Fele∂Uele

    ]i,j

    diag[DFDUspts(:, ele, i, j, 1)] =

    [∂Gele∂Uele

    ]i,j

    diag[DFDU Ifpts(:, ele, i, j)] =

    [∂FSele∂USele

    ]i,j

    0 0 0

    0[∂FEele∂UEele

    ]i,j

    0 0

    0 0[∂FNele∂UNele

    ]i,j

    0

    0 0 0[∂FWele∂UWele

    ]i,j

    Eq.(68) reveals that each Jacobian subblock is comprised of three terms. The first two terms are simplythe polynomial differentiation operators Dξ and Dη with columns scaled by the transformed flux derivativeterms at the solution points. The third term is the flux point divergence operator ∇fpts, scaled by thetransformed flux derivative terms at the flux points, right multiplied by the extrapolation operator E. Withthis operation broken down into basic tasks, implementation into a simple GPU kernel can be completed. Inthe current implementation, this operation is completed by assigning a warp of 32 threads to each column ofsubblocks. Each warp iterates through the subblocks in the column, computing and filling in the respectiveJacobian entries in LHS. A visual depiction of the thread assignment for this kernel can be seen in Figure6. The final formation of the LHS matrices can be completed by applying Eqs.(55) and (57).

    LHS =

    Nspts ×Nvars

    Nspts

    V ar0 V ar1

    V ar0

    V ar1

    Threads

    Figure 6: Data layout of LHS array with thread assignment

    4. Solving the element-local linear systems

    For both the CPU and GPU implementations of the code, the linear system solve occurs in two parts. Forthe CPU implementation, after the LHS matrices have been constructed for all colors, they are immediatelyLU factored, and the factored matrices are stored. During the update loop over colors, the element-local

    18 of 33

    American Institute of Aeronautics and Astronautics

  • systems are solved using lower and upper triangular direct solves. These operations are implemented usingexisting functions from the TNT/JAMA libraries.26

    For the GPU implementation, a modified procedure is used to enhance performance. At first, the GPUcode was implemented using the same procedure as the CPU code, utilizing the batched LU factorizationand batched LU solve functionality from CUBLAS for simplicity. The batched functionality was chosen dueto the relatively small size of the LHS matrices. However, it was found that the batched solver performedpoorly, taking up a sizable portion of the time spent in the color update loop. If the LHS matrices are frozenover several iterations, which is commonly done for steady-state computations, an option to mitigate thiscost is to compute the inverse of the LHS matrices. This replaces the batched triangular solves with batchedelement-wise matrix vector multiplications which are much higher performing on the GPU. Hoffmann et.al utilized a similar procedure in their study.19 In the current implementation, the LHS matrix inversesare computed using CUBLAS. First, the LU factors of the matrices are computed using a batched LUfactorization. This is followed by a batched out-of-place computation of the inverses. Unfortunately, theout-of-place nature of the inverse computation requires an additional copy of the LHS storage. To limitthe amount of additional storage, the inversion is performed in multiple blocks. With this procedure, onlya subset of the LHS matrices is stored along with the full storage required for the inverses.

    5. Mesh Coloring and Modifications to Residual Computation

    As required by the MCGS algorithm, a procedure to color meshes was implemented. There were severalrequirements for the coloring algorithm. The first of these requirements was to maintain a balanced distribu-tion of colors to ensure that the computations for each color require equivalent amounts of work. The secondrequirement was to use fewer colors when possible in order to maintain larger subproblem sizes, leading tomore efficient GPU performance.

    To accomplish this, a modified greedy mesh coloring algorithm was implemented. To color the mesh,a target number of two colors is set and a vector of counts for each color is initialized to zero. Then, theelement connectivity graph is traversed in a breadth-first order. For each element encountered, the color ofneighboring elements is queried and the element is set to a color unused by its neighbors with the lowestcount. The count for the used color is increased by one and the next element is processed. If an element isencountered where all available colors are already used by its neighbors, the procedure has failed to use thetarget number of colors. The target number of colors is increased by one, the element colors are reset, andthe process is repeated until a feasible number of colors is found.

    Some coloring results from this procedure can be observed in Figure 7. Note that on the structuredquadrilateral meshes, the algorithm correctly applies the minimum two colors required. For the unstructuredmixed NACA0012 mesh, the algorithm applies four colors, a minimum number of colors required for a planargraph via the four-color theorem; however, for a general unstructured mesh, the algorithm may apply morethan four colors.27 Additionally, the algorithm distributes colors very evenly as desired. For multi-CPU/GPUcases, the mesh coloring is completed in serial on a single process, with the resulting coloring distributedbetween partitions.

    With the mesh coloring in place, the existing global residual computation must be modified to allowfor limited computation on elements of a specific color. To accomplish this, the data structures used forthe residual computation are reorganized to group elements of common color together. This is depicted inFigure 8. Now, for solution point operations (steps 1, 3 and 4 in Section IV.A.2), the residual computationon elements of a particular color only requires specification of the element range corresponding to that color.

    For the computation of the transformed common interface flux, only flux point pairs involving the targetcolor should be updated (and communicated via MPI in the multi-CPU/GPU case). Despite this, thecurrent implementation updates the common interface flux at all flux points, regardless of the color beingupdated. For two colors, this does not degrade performance much since all flux points are involved in theresidual computation. For more than two colors, this can be inefficient since some common interface fluxesare computed unnecessarily. A more effective strategy to limit the computation to only the required fluxpoints can improve performance and will be implemented in the future.

    19 of 33

    American Institute of Aeronautics and Astronautics

  • (a) Channel mesh, 2 colors, (384, 384)

    (b) NACA0012 mesh, 2 colors, (512, 512)(c) Mixed NACA0012 mesh, 4 colors, (379, 378, 378,377)

    Figure 7: Mesh coloring examples

    Uspts = Nspts

    Neles

    V ar0 V ar1 V ar3V ar2

    Figure 8: Data Layout of Uspts array, colored

    V. Numerical Results

    In this section, inviscid flow over a bump and inviscid flow over the NACA 0012 airfoil is simulated inorder to verify the implementation of the implicit, high-order DFR method on unstructured meshes for GPUsand present results on efficiency. For inviscid flow over a bump, the rate of convergence of entropy error isverified for a polynomial order of P = 2. In the case of the NACA 0012 airfoil, a grid convergence studyon the lift coefficient compares well with results from Overflow and CFL3D. Iteration counts and wall-clocktimes for convergence are found for all cases and show a decrease in efficiency as the meshes are refined.For a given mesh, the iteration count remains low for higher polynomials leading to more computationallyefficient results. Additionally, it is shown that a mixed mesh can be used to obtain accurate results withoutany significant changes to the algorithm.

    20 of 33

    American Institute of Aeronautics and Astronautics

  • All meshes are constructed using second order boundaries. All simulations are started from uniform flowusing the maximum CFL possible at startup for each case. The CFL is increased at an exponential rateevery time the left-hand side is updated until a maximum specified CFL is reached so that,

    CFL = min(rstart r

    j , rmax)

    CFLadv(P ), (69)

    where rstart is the starting CFL ratio, r was set to 2.0, rmax was set to 10, 000, j is the jth left-hand

    side update and CFLadv(P ) is the maximum CFL value for DG, RK44, linear advection as a function ofpolynomial order.28 It was possible to set a larger maximum CFL but there was no noticeable change inconvergence. In order to improve efficiency, the left-hand side was updated every 100 iterations. There wasvery little change to convergence when it was updated more frequently. Local time-stepping is also usedunless otherwise specified.

    The vector `1 norm of the residual for the continuity equation is computed every 100 iterations and isused to track the convergence of all simulations. A converged solution is assumed if the residual drops by10 orders of magnitude from the initial residual. All cases are performed on a single NVIDIA Tesla C2070.The CPU and GPU versions of the code produced the same results. A two color MCGS implicit methodwithout a backsweep is used for all cases except for the mixed mesh cases which used a four color methodwithout backsweep. Using a backsweep did not prove beneficial for most test cases.

    A. Inviscid flow over a bump

    The first test case involves the solution of subsonic flow over a smooth Gaussian bump in a channel. Theinflow Mach number is set to 0.5 with zero angle of attack. The L2 functional norm of the entropy error isused to determine the accuracy of the solution and is given by

    ‖eS‖L2(Ω) =

    √√√√√∫

    (pp∞

    (ρ∞ρ

    )γ− 1)2dV∫

    ΩdV

    , (70)

    where the integrals are approximated numerically using Gaussian quadrature with 10 quadrature pointsin each element. A full description of this problem can be found online through international high-orderworkshops.29

    The entropy error is computed for a series of meshes using the implicit method on a single GPU anda polynomial order of P = 2. The starting r, total iteration count, wall-clock time and entropy error foreach case is shown in Table 1. The table shows that the total amount of iterations needed for convergenceincreases as the mesh is refined.

    Neles (24× 8) (48× 16) (96× 32) (192× 64)rstart 30.0 25.0 6.0 2.0

    Iterations 1400 2700 5200 10300

    Wall Time (s) 0.722 2.01 10.43 75.35

    Entropy Error 7.28e-05 1.04e-05 1.40e-06 1.79e-07

    Table 1: Convergence results for different meshes, inviscid flow over a bump, implicit MCGS, single GPU,P = 2

    The entropy error is then computed on the (48× 16) quadrilateral mesh for a series of polynomial ordersusing the implicit method on a single GPU. The starting r, total iteration count, wall-clock time and entropyerror for each case is shown in Table 2. The table shows that the iteration count remains relatively the sameas the polynomial order increases.

    Figure 9 shows the entropy error vs. length scale h = 1√nDoF

    and wall-clock time in seconds. The results

    show a rate convergence of 2.89 for the fixed polynomial order of P = 2 which is close to the theoreticalresults for a linear, steady-state case: P + 1. The results also show that increasing the polynomial order onthe (48 × 16) mesh reduces the entropy error while maintaining a smaller wall-clock time compared to themore refined mesh.

    21 of 33

    American Institute of Aeronautics and Astronautics

  • P 2 3 4 5

    rstart 25.0 35.0 4.0 8.0

    Iterations 2700 4400 4700 4300

    Wall Time (s) 2.01 4.92 13.47 20.02

    Entropy Error 1.04e-05 1.30e-06 6.65e-07 3.63e-07

    Table 2: Convergence results for different polynomial orders, inviscid flow over a bump, implicit MCGS,single GPU, (48× 16) quadrilateral mesh

    10−2

    10−9

    10−8

    10−7

    10−6

    10−5

    10−4

    h = 1√nDoF

    ‖eS‖ L

    2(Ω

    )

    P = 2(48× 16)

    Order 3

    (a) Entropy error vs. 1√nDoF

    100

    101

    102

    10−7

    10−6

    10−5

    10−4

    Wall-clock time (s)

    ‖eS‖L2(Ω

    )

    P = 2(48× 16)

    (b) Entropy Error vs. Wall Time (s)

    Figure 9: Entropy error, inviscid flow over a bump, implicit MCGS, P = 2.

    The convergence history of the test case using a (48 × 16) quadrilateral mesh is shown in Figure 10 foran explicit RK4 method and the implicit MCGS method. As expected, the implicit method converges at amuch faster rate. The mesh and final pressure contours are shown in Figure 11.

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010

    −12

    10−10

    10−8

    10−6

    10−4

    10−2

    100

    Iterations

    ‖Res(ρ

    )‖ℓ1

    RK4MCGS

    (a) Residual vs. Iterations

    0 2 4 6 8 10 1210

    −12

    10−10

    10−8

    10−6

    10−4

    10−2

    100

    Wall-clock time (s)

    ‖Res(ρ

    )‖ℓ1

    RK4MCGS

    (b) Residual vs. Wall Time (s)

    Figure 10: Convergence history, (48× 16) quadrilateral mesh, inviscid flow over a bump, P = 2.

    22 of 33

    American Institute of Aeronautics and Astronautics

  • -1

    -0.75

    -0.5

    -0.25

    0

    Cp

    -1.19

    0.212

    Figure 11: Mesh and pressure contours, (48 × 16) quadrilateral mesh, inviscid flow over a bump, implicitMCGS, P = 2.

    B. Inviscid flow over the NACA 0012 airfoil

    The second test case involves the solution of subsonic flow over the NACA 0012 airfoil. The inflow Machnumber is set to 0.5 with 1.25 degree angle of attack. The lift coefficient is used to determine the accuracyof the simulation and is compared to results from Vassberg and Jameson.30 A complete description is alsofound in this reference.

    The lift and drag coefficients are computed for a series of O-meshes using the implicit method on a singleGPU and a polynomial order of P = 4. All meshes have a far field located 100 chord lengths away. Thestarting CFL, total iteration count, wall-clock time, lift coefficient and drag coefficient for each case is shownin Table 3. As in the previous test case, the table shows that the total amount of iterations needed forconvergence increases as the mesh is refined.

    Neles (8× 8) (16× 16) (32× 32) (64× 64) (128× 128)rstart 4.0 2.0 1.5 1.0 1.0

    Iterations 800 1000 2000 4200 7600

    Wall Time (s) 0.718 1.302 7.52 56.48 395.32

    Lift Coefficient 1.8107e-01 1.8037e-01 1.7948e-01 1.7949e-01 1.7950e-01

    Drag Coefficient 1.0108e-03 6.9629e-05 1.9481e-05 1.8437e-05 1.8355e-05

    Table 3: Convergence results for different meshes, inviscid flow over the NACA 0012 airfoil, implicit MCGS,single GPU, P = 4

    The lift and drag coefficients are then computed for a series of mixed meshes using the implicit methodon a single GPU and a polynomial order of P = 4. Global time stepping is used in this case. All mesheshave a far field located 100 chord lengths away. The starting CFL, total iteration count, wall-clock time,lift coefficient and drag coefficient for each case is shown in Table 4. The table shows that a mixed meshcan be used to converge the lift coefficient. It can also be observed that the mixed mesh cases maintain theability to run large timesteps, unconstrained by the explicit CFL limit. This is a notable result, as a strongCFL constraint was observed to limit the utility of the collapsed-edge triangular elements when coupled withexplicit timestepping.23

    Lastly, the lift and drag coefficients are computed on the (32 × 32) quadrilateral O-mesh for a seriesof polynomial orders using the implicit method on a single GPU. The starting CFL, total iteration count,wall-clock time, lift coefficient and drag coefficient for each case is shown in Table 5. The table shows that

    23 of 33

    American Institute of Aeronautics and Astronautics

  • Neles 764 1512 6048

    rstart 2.0 2.0 0.5

    Iterations 3600 4300 10500

    Wall Time (s) 12.28 24.81 220.14

    Lift Coefficient 1.7953e-01 1.7952e-01 1.7949e-01

    Drag Coefficient 3.2210e-05 3.2162e-05 2.3628e-05

    Table 4: Convergence results for different mixed meshes, inviscid flow over the NACA 0012 airfoil, implicitMCGS, single GPU, P = 4

    the iteration count remains low for higher polynomial orders.

    P 2 3 4 5

    rstart 5.0 2.0 1.5 1.0

    Iterations 1500 3700 2000 3900

    Wall Time (s) 1.357 5.23 7.50 24.54

    Lift Coefficient 1.7853e-01 1.7963e-01 1.7948e-01 1.7950e-01

    Drag Coefficient 1.4102e-04 3.1818e-05 1.9481e-05 1.8758e-05

    Table 5: Convergence results for different polynomial orders, inviscid flow over the NACA 0012 airfoil,implicit MCGS, single GPU, (32× 32) quadrilateral mesh

    Figure 12 shows the lift coefficient vs. length scale h = 1√nDoF

    and wall-clock time in seconds. Degrees

    of freedom for Overflow and CFL3D are assumed to be equal to the number of mesh elements. The figureshows that ZEFR is able to obtain a lift coefficient that is relatively close to the results from Overflow andCFL3D. It is also important to note that the lift coefficient is grid converged within three significant figureswith fewer degrees of freedom. The test cases with mixed meshes obtained similar results. The coarsestmixed mesh was able to obtain a fairly accurate lift coefficient with less degrees of freedom by coarsening thefar field regions. Unfortunately, the wall-clock time was still more than the wall-clock time for the (32× 32)quadrilateral mesh. The wall-clock time may be larger because the four color implicit method is not yetfully optimized.

    10−4

    10−3

    10−2

    0.178

    0.1785

    0.179

    0.1795

    0.18

    0.1805

    0.181

    0.1815

    0.182

    h = 1√nDoF

    Cl

    ZEFR - Quad, P = 4

    ZEFR - Mixed, P = 4

    OverflowCFL3D

    (a) Lift coefficient vs. 1√nDoF

    10−1

    100

    101

    102

    103

    0.1794

    0.1796

    0.1798

    0.18

    0.1802

    0.1804

    0.1806

    0.1808

    0.181

    0.1812

    Wall-clock time (s)

    Cl

    ZEFR - Quad, P = 4ZEFR - Mixed, P = 4

    (b) Lift coefficient vs. Wall-clock time (s)

    Figure 12: Lift coefficient for inviscid flow over the NACA 0012 airfoil, implicit MCGS.

    The convergence history of the test case using a (32× 32) quadrilateral O-mesh is shown in Figure 13 for

    24 of 33

    American Institute of Aeronautics and Astronautics

  • an explicit RK4 method and the implicit MCGS method. As expected, the implicit method converges at amuch faster rate. The mesh and final pressure contours are shown in Figure 14.

    0 2000 4000 6000 8000 1000010

    −12

    10−10

    10−8

    10−6

    10−4

    10−2

    100

    Iterations

    ‖Res(ρ)‖

    ℓ1

    RK4MCGS

    (a) Residual vs. Iterations

    0 5 10 15 20 25 3010

    −12

    10−10

    10−8

    10−6

    10−4

    10−2

    100

    Wall-clock time (s)

    ‖Res(ρ)‖

    ℓ1

    RK4MCGS

    (b) Residual vs. Wall-clock time (s)

    Figure 13: Convergence history, (32 × 32) quadrilateral O-mesh, inviscid flow over the NACA 0012 airfoil,P = 4.

    -0.4

    0

    0.4

    0.8

    Cp

    -0.72

    1

    (a) (32 × 32) O-mesh

    -0.4

    0

    0.4

    0.8

    Cp

    -0.72

    1

    (b) 764 Element Mixed Mesh

    Figure 14: Mesh and pressure contours, inviscid flow over the NACA 0012 airfoil, implicit MCGS, P = 4.

    VI. Performance Analysis

    In this section, the computational performance of the GPU and multi-GPU implementation is character-ized. Results comparing single GPU performance relative to a single CPU core of the implicit MCGS andexplict RK4 schemes are presented. This is followed by a strong and weak scalability study of the multi-GPUimplementation for both schemes. For this section, all simulations are performed using NVIDIA Tesla C2070GPUs and Intel Xeon X5650 CPUs. The multi-GPU cases are run on two nodes of a GPU cluster, with sixGPUs and two CPUs installed on each node.

    A. Single GPU

    In this section, the performance of the GPU implementation for the RK4 and MCGS scheme is comparedwith the serial implementation of a single CPU core. Inviscid flow over the NACA0012 airfoil is computedusing different mesh sizes and polynomial orders using the same flow parameters as described in Section V.Since convergence is not important in this section, the total number of iterations is fixed to 1000 and the

    25 of 33

    American Institute of Aeronautics and Astronautics

  • time step is fixed to ∆t = 1× 10−9. Tables 6 and 8 show the overall speedup of the GPU code comparedto the CPU code for the RK4 and MCGS schemes, respectively. Tables 7 and 9 report the iterations persecond achieved by CPU code and GPU code for the RK4 and MCGS schemes, respectively.

    Neles (32× 32) (64× 64) (128× 128) (256× 256)P = 2 14.3 20.3 25.4 28.1

    P = 3 21.7 24.7 30.8 32.7

    P = 4 16.5 23.4 27.0 30.5

    P = 5 17.5 25.1 29.1 30.8

    Table 6: Speedup of a single GPU over a single CPU core for Explicit RK4, inviscid flow over the NACA0012airfoil

    Neles (32× 32) (64× 64) (128× 128) (256× 256)P = 2 (57.4 / 819.7) (13.6 / 276.2) (3.23 / 82.0) (0.763 / 21.4)

    P = 3 (30.3 / 657.9) (8.11 / 200.8) (1.86 / 57.4) (0.461 / 15.1)

    P = 4 (24.1 / 396.8) (5.73 / 134.2) (1.37 / 37.2) (0.311 / 9.49)

    P = 5 (16.4 / 287.4) (3.88 / 97.7) (0.896 / 26.1) (0.215 / 6.63)

    Table 7: Iterations per second (CPU/GPU) for Explicit RK4, inviscid flow over the NACA0012 airfoil

    Neles (32× 32) (64× 64) (128× 128) (256× 256)P = 2 12.5 18.0 20.5 21.7

    P = 3 18.6 22.3 24.2 25.4

    P = 4 15.3 17.2 18.0 —

    P = 5 17.9 20.3 21.4 —

    Table 8: Speedup of a single GPU over a single CPU core for Implicit MCGS, inviscid flow over theNACA0012 airfoil

    Neles (32× 32) (64× 64) (128× 128) (256× 256)P = 2 (88.9 / 1111.1) (21.3 / 381.7) (5.10 / 104.5) (1.23 / 26.7)

    P = 3 (38.7 / 718.9) (9.92 / 221.2) (2.33 / 56.5) (0.569 / 14.5)

    P = 4 (18.1 / 278.6) (4.52 / 77.9) (1.11 / 20.0) —

    P = 5 (9.70 / 174.2) (2.41 / 48.9) (0.590 / 12.6) —

    Table 9: Iterations per second (CPU/GPU) for Implicit MCGS, inviscid flow over the NACA0012 airfoil

    From the results, several observations can be made. First, the achieved speedup factor of the GPU codeover the CPU code for both the explicit and implicit methods increases with mesh size. This is reasonable,as larger problem sizes can more effectively utilize GPU resources. Next, it can be observed that for lowerpolynomial orders, the implicit MCGS method completes more iterations per second than the explicit RK4scheme. As the polynomial order is increased however, this trend is reversed, with the explicit RK4 schemeachieving a greater iteration rate. This is unsurprising, as the size of the element-local linear systems forthe implicit scheme grows very quickly with respect to P . This increases the cost of solving the systems

    26 of 33

    American Institute of Aeronautics and Astronautics

  • at each iteration, leading to longer iteration times. The last trend observed in the performance results isa limitation in the maximum single GPU problem size for the implicit scheme. For the 256 × 256 elementmesh, the P = 4 and P = 5 cases could not complete due to memory requirements exceeding the capacity ofthe GPU. For reference, the Tesla C2070 GPUs contain 6 GB of device memory. For the P = 4 case, eachelement-local system requires (25 × 4)2 = 10000 double-precision floating point values. For the 256 × 256mesh, this results in a memory requirement of 5.24 GB just to store the system matrices. As noted in SectionIV.B.4, the current GPU implementation requires some additional storage to compute the system inverses.For these cases, the additional storage was set to 25% of the storage required for the system matrices, leadingto a memory requirement of 6.55 GB which exceeds the capacity of the GPU. A similar computation for theP = 5 case reveals that in that case, the system matrices alone require 10.87 GB of memory. This providesclear motivation for future investigation into methods of reducing the memory required for the scheme. Inthe next section, it will be shown that this limitation can be overcome through distribution of the problemacross multiple GPUs.

    B. Multi-GPU

    In this section, the performance of the multi-GPU implementation for the RK4 and MCGS scheme is inves-tigated. Since convergence is not important in this section, the total number of iterations in all cases is fixedto 5000 and the time step is fixed to ∆t = 1× 10−9.

    1. Strong Scalability

    In order to analyze the strong scaling efficiency of the multi-GPU implementation, a sequence of quadrilateralO-meshes are used to compute inviscid flow over the NACA0012 airfoil, using P = 5 polynomials to representthe solution. The same flow parameters are used as described in Section V. Figure 15 shows the speedup ofup to 12 GPUs relative to one GPU for both the explicit RK4 and implicit MCGS schemes.

    From these results, it can be seen that both the explicit and implicit implementations exhibit quite goodstrong scalability, with improved scaling as the problem size increases. This is because with large problemsizes, the amount of computation in the partition interiors increases, relative to the amount of communicationrequired at the partition boundaries. This tends to reduce the contribution of any overhead introduced bythe MPI communication procedures to the overall computation time. An additional observation is that theMCGS scheme achieves better scalability at a smaller problem size than the explicit RK4 scheme. Thisindicates that the amount of computation relative to the communication for the MCGS scheme is increasingmore rapidly with problem size.

    2 3 4 5 6 7 8 9 10 11 120

    2

    4

    6

    8

    10

    12

    Number of GPUs

    Speedup

    32x3264x64128x128256x256

    (a) RK4

    2 3 4 5 6 7 8 9 10 11 120

    2

    4

    6

    8

    10

    12

    Number of GPUs

    Speedup

    32x3264x64128x128

    (b) MCGS

    Figure 15: Speedup relative to one GPU, inviscid flow over the NACA0012, P = 5

    27 of 33

    American Institute of Aeronautics and Astronautics

  • 2. Weak Scalability

    In order to analyze the weak scaling efficiency of the multi-GPU implementation, a sequence of quadrilateralO-meshes are used to compute inviscid flow over the NACA0012 airfoil, using P = 5 polynomials to representthe solution. The same flow parameters are used as described in Section V. The mesh sequence begins withthe 128× 128 O-mesh, with three additional meshes, each doubling the number of elements of the previousmesh in the sequence. Note that the starting mesh is the largest mesh run with the MCGS scheme on asingle GPU for P = 5. For this study, the first mesh in the sequence is run on a single GPU, with subsequentmeshes run on an increasing number GPUs to keep the problem size per GPU constant. The wall clocktimes and the achieved efficiency for both the RK4 and MCGS schemes for this study are reported in Tables10 and 11. For the implicit scheme, the memory required to store linear system in GB is also reported.

    NGPUs 1 2 4 8

    Neles (128× 128) (256× 128) (256× 256) (512× 256)Wall Time (s) 191.6 205.28 213.04 212.11

    Efficiency (%) 100 93.3 89.9 90.3

    Table 10: Weak scalability results for the multi-GPU implementation, inviscid flow over the NACA0012,P = 5, Explicit RK4

    NGPUs 1 2 4 8

    Neles (128× 128) (256× 128) (256× 256) (512× 256)Wall Time (s) 396.81 398.19 401.35 403.35

    Efficiency (%) 100 99.6 98.9 98.4

    Memory Req. for Linear Systems (GB) 2.71 5.44 10.87 21.74

    Table 11: Weak scalability results for the multi-GPU implementation, inviscid flow over the NACA0012,P = 5, Implicit MCGS

    Considering the results, the explicit RK4 implementation maintains a high level of performance, withthe efficiency dropping to only around 90% for a problem distributed over 8 GPUs. The implicit MCGSimplementation maintains even higher levels of performance, achieving greater than 98% efficiency across allcases tested. Considering the linear system sizes in each case, this performance is maintained in problemswith linear system sizes requiring from 2.71 GB up to 21.74 GB of memory. This result suggests that theimplicit MCGS scheme can be effectively distributed over multiple GPUS to solve larger problems withouta significant degradation in performance.

    VII. Conclusions

    In this paper, a high-order compressible flow solver for unstructured grids is developed, implemented andanalyzed. The solver utilizes the direct Flux Reconstruction (DFR) method and a multicolored Gauss-Seidel(MCGS) implicit method to converge the steady state Euler equations in a multi-GPU environment. Thenumerical results show that the correct rate of convergence for entropy error is obtained at a polynomialorder of P = 2 for inviscid flow over a bump. A grid convergence study is performed on the NACA 0012airfoil and a lift coefficient that compares well with Overflow and CFL3D is obtained with fewer degrees offreedom. For a given mesh, the iteration count needed for convergence remains low for higher polynomialorders but increases as the mesh is refined.

    A performance analysis of the explicit RK4 and implicit MCGS method is performed in order to assessthe capabilities of the implicit scheme on a single GPU and on multiple GPUs. The results show that forlower polynomial orders, the implicit scheme achieves a greater iteration rate when compared to the explicitscheme. As the polynomial order is increased, the iteration rate is no longer greater than the explicit scheme

    28 of 33

    American Institute of Aeronautics and Astronautics

  • due to the amount of work that increases rapidly with polynomial order. The memory size of the left-handside matrices also increases rapidly with polynomial order leading to test cases which could not be simulatedon a single GPU.

    It is shown that the bottleneck in memory usage can be mitigated by using multiple GPUs. Near perfectweak-scaling is maintained for problem sizes which require up to 21.74 GB of data storage for left-handmatrices. The implicit method maintains higher levels of performance compared to the explicit method dueto the increased amount of work for the method. Both the explicit and implicit methods are able to achievegood strong scaling results with improved performance on larger problem sizes. The implicit method is alsoable to achieve better scalability.

    Despite the memory deficiencies of storing the left-hand side matrices for large polynomial orders, theresults show promise for solving problems of engineering importance on small GPU clusters. An investiga-tion into methods to reduce memory requirements of the implicit scheme is currently underway and futuremilestones include a multi-GPU, implicit solver for the Navier-Stokes equations, the RANS equations andunsteady problems via dual time stepping.

    29 of 33

    American Institute of Aeronautics and Astronautics

  • Appendix

    A. Boundary Conditions

    For the Euler equations, the common interface fluxes at boundary faces are computed as,

    FB = Fn(UB(U)), (71)

    where Fn(U) is the flux normal to the face, UB(U) is the solution prescribed at the boundary face and

    U = [ρ, ρu, ρv, e]T

    is the solution extrapolated to the face.

    1. Solid Slip-Wall and Symmetry

    On a solid surface where the flow is allowed to slip, the flow must remain tangent to the surface.31 Thevelocities on the boundary can be written as,

    ub = u− V nnx, vb = v − V nny,

    where V n = unx+vny and nx, ny are the x and y components of the unit normal vector n. An extrapolatedpressure is also used to compute the total energy on the wall so that the solution on the boundary face iscomputed as,

    UB =

    ρ

    ρub

    ρvbp

    γ−1 +12ρ(u

    2b + v

    2b )

    . (72)This same boundary condition can be applied for symmetry boundary conditions.

    2. Characteristic Riemann Invariant Far Field

    On a far field boundary, Riemann invariants for a one dimensional flow normal to the boundary are used todetermine the solution at the boundary face.32 First, the velocity normal to the face and the speed of soundis computed using the extrapolated solution and the free stream values,

    V n = unx + vny, Vn∞ = u∞nx + v∞ny,

    c =

    √γp

    ρ, c∞ =

    √γp∞ρ∞

    ,

    where nx, ny are the x and y components of the unit normal vector n and ∞ denotes freestream values thatare set for a specific problem. The Riemann invariants can then be written as,

    R = V n +2c

    γ − 1 , R∞ = Vn∞ −

    2c∞γ − 1 .

    The normal velocity and speed of sound at the boundary are written as,

    V nb =1

    2(R+R∞), cb =

    γ − 14

    (R−R∞),

    If V n < 0, the flow is entering the domain and the velocity and entropy at the boundary is computed as,

    ub = u∞ + (Vnb − V n∞)nx, vb = v∞ + (V nb − V n∞)ny, sb =

    p∞ργ∞

    ,

    otherwise, the flow is exiting the domain and the velocity and entropy at the boundary is computed as,

    ub = u+ (Vnb − V n)nx, vb = v + (V nb − V n)ny, sb =

    p

    ργ.

    The density and the pressure at the boundary can be computed from the entropy and speed of sound sothat,

    ρb =

    (1

    γ

    c2bs

    ) 1γ−1

    , pb =1

    γρbc

    2b .

    30 of 33

    American Institute of Aeronautics and Astronautics

  • The solution at the boundary can then be computed as,

    UB =

    ρbρbubρbvb

    pbγ−1 +

    12ρb(u

    2b + v

    2b )

    . (73)

    B. Jacobian Matrices

    The Jacobian matrices for the Euler equations can be written as

    ∂F

    ∂U=

    0 1 0 0

    12

    ((γ − 3)u2 + (γ − 1)v2

    )(3− γ)u (1− γ)v γ − 1

    −uv v u 0−γeuρ + (γ − 1)u

    (u2 + v2

    )γeρ +

    (1−γ)2

    (3u2 + v2

    )(1− γ)uv γu

    ∂G

    ∂U=

    0 0 1 0

    −uv v u 012

    ((γ − 1)u2 + (γ − 3)v2

    )(1− γ)u (3− γ)v γ − 1

    −γevρ + (γ − 1)v(u2 + v2

    )(1− γ)uv γeρ +

    (1−γ)2

    (u2 + 3v2

    )γv

    . (74)The Jacobian matrices for the boundary conditions can be computed by taking the derivative of the commoninterface flux at the boundary with respect to the extrapolated solution. Eq.(71) is differentiated to obtain,

    ∂FB

    ∂U=

    ∂F

    ∂UB

    ∂UB∂U

    nx +∂G

    ∂UB

    ∂UB∂U

    ny (75)

    where nx, ny are the x and y components of the unit normal vector n,∂F∂UB

    = ∂F∂U (UB),∂G∂UB

    = ∂G∂U (UB) and∂UB∂U depends on the boundary condition being used. The derivative of the common interface flux can then

    be transformed using a similar operation to Eq.(61).

    1. Solid Slip-Wall and Symmetry

    The derivative of the solution at the boundary follows directly from Eq.(72),

    ∂UB∂U

    =

    1 0 0 0

    0 1− n2x −nxny 00 −nxny 1− n2y 0

    12 (u

    2 + v2 − u2b − v2b ) −u+ (1− n2x)ub − nxnyvb −v − nxnyub + (1− n2y)vb 1

    .

    2. Characteristic Riemann Invariant Far Field

    The derivative of the solution at the boundary is computed by differentiating Eq.(73). The solution is rathercomplicated and is derived using Mathematica. Given the flow variables defined in section A.2, the inflowand outflow derivatives can be defined separately. If V n < 0, the flow is entering the domain and thefollowing parameters are defined,

    a1 =ρb2cb

    , a2 =γ

    ρc,

    b1 = −V n

    ρ− a2

    ρ

    (p

    γ − 1 −1

    2ρ(u2 + v2)

    ),

    b2 =nxρ− a2u, b3 =

    nyρ− a2v, b4 =

    a2cb,

    c1 =c2b

    γ(γ − 1) +1

    2(u2b + v

    2b ), c2 = ubnx + vbny +

    cbγ.

    31 of 33

    American Institute of Aeronautics and Astronautics

  • The derivative of the solution at the boundary can then be computed as,

    ∂UB∂U

    =

    a1b1 a1b2 a1b3

    12ρbb4

    a1b1ub +12ρbb1nx a1b2ub +

    12ρbb2nx a1b3ub +