fluid dynamics and gpu - seas.upenn.educis565/lecture2010/fluidpresentation.pdf · 3 2 traditional...

Fluid Dynamics and GPUYue Wang

Thank you

3

2 Traditional Computational Fluid Dynamics for BuildingSimulationComputational fluid dynamics (CFD) is one of the branches of fluidmechanics that uses numerical methods and algorithms to solveand analyze problems that involve fluid flows. Computers are usedto perform the millions of calculations required to simulate theinteraction of liquids and gases with surfaces defined by boundaryconditions.

The foundation of CFD is based on the Navier-Stokes equations.The Navier-Stokes equations, named after Claude-Louis Navier andGeorge Gabriel Stokes, describe the motion of fluid substances, thatis substances that can flow. These equations arise from applyingNewton’s second law to fluid motion, together with the assumptionthat the fluid stress is the sum of a diffusing viscous term propor-tional to the gradient of velocity, plus a pressure term.

A fluid whose density and temperature are nearly constant isdescribed by a velocity field u and a pressure field �. These quan-tities generally vary both in space and in time and depend on theboundaries surrounding the fluid. We will denote the spatial co-ordinate by x, which for two- dimensional fluids is x = (�� ) andthree-dimensional fluids is equal to (�� ). Given that the velocityand the pressure are known for some initial time � = 0, then theevolution of these quantities over time is given by the Navier-Stokesequations:

∇ · u = 0 ∂u∂� = −(u · ∇)u − 1

ρ∇� + ν∇2u + f (1)

where ν is the kinematic viscosity of the fluid, ρ is its density and fis an external force [6].

The Reynolds-averaged Navier-Stokes (RANS) equations are time-averaged equations of motion for fluid flow. They are primarilyused while dealing with turbulent flows. These equations can be

the details about the airflow and contaminant transport inside thezone [6].

Recently, an FFD method [7] has been proposed for fast flowsimulations in buildings as an intermediate method between theCFD and zonal/multizone models. The FFD method solves thecontinuity equation and unsteady Navier–Stokes equations asthe CFD does. By using a different numerical scheme to solve thegoverning equations, the FFD can run about 50 times faster than theCFD with the same numerical setting on a single CPU [8]. Althoughthe FFD is not as accurate as the CFD, it can provide more detailedinformation than a multizone model or a zonal model.

Although the FFD is much faster than the CFD, its speed is stillnot fast enough for the real-time flow simulation in a building. Forexample, our previous work [8] found that the FFD simulation canbe real-time with 65,000 grids. If a simulation domain with30! 30! 30 grids is applied for a room, the FFD code can onlysimulate the airflow in 2–3 rooms on real-time. Hence, if we wantto do real-time simulation for a large building, we have to furtheraccelerate the FFD simulation.

To reduce the computing time, many researchers have per-formed the flow simulations in parallel on multi-processorcomputers [9,10]. It is also possible to speed up the FFD simulationby running it in parallel on a multi-processor computer. However,this approach needs large investments in equipment purchase andinstallation and a designated space for installing the computers andthe related capacity of the cooling system used in the space. Inaddition, the fees for the operation and maintenance of a multi-processor computer are also nearly the same as those of severalsingle processor computers of the same capacity. Hence, multi-processor computers are a luxury for building designers or emer-gency management teams.

Recently, the GPU has attracted attention for parallel computing.Different from a CPU, the GPU is the core of a computer graphicscard and integrates multiple processors on a single chip. Its struc-ture is highly parallelized to achieve high performance for imageprocessing. For example, an NVIDA GeForce 8800 GTX GPU, avail-able since 2006, integrates 128 processors so that its peakcomputing speed is 367 GFLOPS (Giga FLoating point Operation PerSecond). Comparatively, the peak performance of an INETL Core2Duo 3.0 GHz CPU available at the same time is only about 32GFLOPS [11]. Fig.1 compares the computing speeds of CPU and GPU.The speed gap between the CPU and the GPU has been increasingsince 2003. Furthermore, this trend is likely to continue in thefuture. Besides GPU’s high performance, the cost of a GPU is low. Forexample, a graphics card with NVIDIA GeForce 8800 GTX GPU costsonly around $500. It can easily be installed onto a personalcomputer and there are no other additional costs.

Thus, it seems possible to realize fast and informative indoorairflow simulations by using the FFD on a GPU. This paper reports

our efforts to implement the FFD model in parallel on an NVIDIAGeForce 8800 GTX GPU. The GPU code was then validated bysimulating several flows that consist of the basic features of indoorairflows.

2. Fast fluid dynamics

Our investigation used the FFD scheme proposed by Stam [7].The FFD applies a splitting method to solve the continuity equation(1) and Navier–Stokes equation (2) for an unsteady incompressibleflow:

vUivxi

" 0; (1)

vUivt

" #UjvUivxj

$ nv2Ui

vx2j# 1

rvPvxi

$ fir; (2)

where Ui and Uj are fluid velocity components in xi and xj directions,respectively; n is kinematic viscosity; r is fluid density; P is pres-sure; t is time; and fi are body forces, such as buoyancy force and

0

50

100

150

200

250

300

350

400

2003 2004 2005 2006 2007

GFL

OPS

year

GPUCPU

Fig. 1. Comparison of the computing speeds of GPU (NVIDIA) and CPU (INTEL) since2003 [11].

Nomenclature

ai,j, bi,j equation coefficient (dimensionless)C contaminant concentration (kg/m3)fi body force (kg/m2 s2)H the width of the room (m)i, j mesh node indiceskC contaminant diffusivity (m2/s)kT thermal diffusivity (m2/s)L length scale (m)P pressure (kg/m s2)SC contaminant source (kg/m3 s)ST heat source (%C/s)

T temperature (%C)t time (s)uij velocity components at mesh node (i, j) (m/s)Ub bulk velocity (m/s)Ui, Uj velocity components in xi and xj directions,

respectively (m/s)U horizontal velocity or velocity scale (m/s)V vertical velocity (m/s)xi, xj spatial coordinatesx, y spatial coordinatesDt time step (s)n kinematic viscosity (m2/s)0 previous time step

W. Zuo, Q. Chen / Building and Environment 45 (2010) 747–757748

3

2 Traditional Computational Fluid Dynamics for BuildingSimulationComputational fluid dynamics (CFD) is one of the branches of fluidmechanics that uses numerical methods and algorithms to solveand analyze problems that involve fluid flows. Computers are usedto perform the millions of calculations required to simulate theinteraction of liquids and gases with surfaces defined by boundaryconditions.

The foundation of CFD is based on the Navier-Stokes equations.The Navier-Stokes equations, named after Claude-Louis Navier andGeorge Gabriel Stokes, describe the motion of fluid substances, thatis substances that can flow. These equations arise from applyingNewton’s second law to fluid motion, together with the assumptionthat the fluid stress is the sum of a diffusing viscous term propor-tional to the gradient of velocity, plus a pressure term.

A fluid whose density and temperature are nearly constant isdescribed by a velocity field u and a pressure field �. These quan-tities generally vary both in space and in time and depend on theboundaries surrounding the fluid. We will denote the spatial co-ordinate by x, which for two- dimensional fluids is x = (�� ) andthree-dimensional fluids is equal to (�� ). Given that the velocityand the pressure are known for some initial time � = 0, then theevolution of these quantities over time is given by the Navier-Stokesequations:

∇ · u = 0 ∂u∂� = −(u · ∇)u − 1

ρ∇� + ν∇2u + f (1)

where ν is the kinematic viscosity of the fluid, ρ is its density and fis an external force [6].

The Reynolds-averaged Navier-Stokes (RANS) equations are time-averaged equations of motion for fluid flow. They are primarilyused while dealing with turbulent flows. These equations can be

4

used with approximations based on knowledge of the properties

of flow turbulence to give approximate averaged solutions to the

Navier-Stokes equations. For a stationary, incompressible flow of

Newtonian fluid, these equations can be written as:

ρ∂��∂��

= ρ�� + ∂∂��

�−�δ�� + µ

�∂��∂��

+ ∂��∂��

�− ρ��

�� (2)

The K-epsilon model is one of the most common RANS turbu-

lence models. It is a two-equation model, which means, it includes

two extra transport equations to represent the turbulent properties

of the flow. This allows a two-equation model to account for his-

tory effects like convection and diffusion of turbulent energy. The

model is widely used in building science research, especially indoor

air quality and thermo distribution simulation [7], [8], [9], [10], [11].

The model is relatively simple. For turbulent kinetic energy �:

∂∂� (ρ�) + ∂

∂��(ρ��) = ∂

∂��

��µ + µ�

σ�

� ∂�∂��

�+ P� + P� − ρ� − YM + S�

(3)

For dissipation �∂∂� (ρ�)+ ∂

∂��(ρ��) = ∂

∂��

��µ + µ�

σ�

� ∂�∂��

�+C1�

�� (P� + C3�P�)−C2�ρ�2

� +S�

(4)

The K-epsilon model should be a reference implementation for

us since it is the most widely solver in building science. When we

developed a new solver, we should compare the results to the K-

epsilon model and see whether the new solver is accurate enough.

3 Fast Fluid Dynamics: A Tentative Way to Make CFD

Faster

There are several ways to simplify the Navier-Stokes equation nu-

merical solving method to make CFD calculations faster. One of

9

solved to simulate the flow of a Newtonian fluid with collision mod-

els such as Bhatnagar-Gross-Krook (BGK). By simulating stream-

ing and collision processes across a limited number of particles, the

intrinsic particle interactions evince a microcosm of viscous flow

behavior applicable across the greater mass [34], [35], [36].

The essential quantity in the Lattice Boltzmann Method (LBM)

[11] is a density function (DF) ��(x� �) on a discrete lattice, x =(�δ�� δ�� δ�) (� ∈ [0� N�], � ∈ [0� N�], � ∈ [0� N�]) with discrete ve-

locity values e� (� ∈ [0� Nν − 1]) at time t. Here, each e� points from

a lattice site to one of its Nν near-neighbor sites. N� , N� , and N�are the numbers of lattice sites in the �, �, and � directions, respec-

tively, with δ�, δ� and δ� being the corresponding lattice spacing

and Nµ = 18 being the number of discrete velocity values. From

the DF, we can calculate various physical quantities such as fluid

density ρ(x� �) and velocity u(x� �):ρ(x� �) = �

��(x� �) ρ(x� �)u(x� �) = �

�e��(x� �) (13)

The time evolution of the DF is governed by the Boltzmann

equation in the Bhatnagar-Gross-Krook (BGK) model. The LBM

simulation thus consists of a time stepping iteration, in which col-

lision and streaming operations are performed as time is incre-

mented by δ� at each iteration step.

The Collision equation can be written as:

��(x� �+) = ��(x� �) − 1τ (��(x� �) − �eq

� (ρ(x)� u(x))) (14)

where

�eq� (ρ� u) = ρ(A + B(e�� u) + C(e�� u)2 + Du2) (15)

Where A, B, C and D are constants, and the time constant τ is

related to the kinematic viscosity ν through a relation ν = (τ−1/2)/3.

The Streaming equation can be written as:

P1: ARK/ary P2: MBL/vks QC: MBL/bsa T1: MBL

November 24, 1997 15:23 Annual Reviews AR049-12

Annu. Rev. Fluid Mech. 1998. 30:329–64Copyright c� 1998 by Annual Reviews Inc. All rights reserved

LATTICE BOLTZMANN METHODFOR FLUID FLOWSShiyi Chen1,2 and Gary D. Doolen21IBM Research Division, T. J. Watson Research Center, P.O. Box 218, YorktownHeights, NY 10598; 2Theoretical Division and Center for Nonlinear Studies, LosAlamos National Laboratory, Los Alamos, NM 87545; e-mail: [email protected]

KEY WORDS: lattice Boltzmann method, mesoscopic approach, fluid flow simulation

ABSTRACTWe present an overview of the lattice Boltzmann method (LBM), a parallel andefficient algorithm for simulating single-phase and multiphase fluid flows andfor incorporating additional physical complexities. The LBM is especially usefulformodeling complicated boundary conditions andmultiphase interfaces. Recentextensions of thismethod are described, including simulations of fluid turbulence,suspension flows, and reaction diffusion systems.

INTRODUCTIONIn recent years, the lattice Boltzmann method (LBM) has developed into analternative and promising numerical scheme for simulating fluid flows andmodeling physics in fluids. The scheme is particularly successful in fluid flowapplications involving interfacial dynamics and complex boundaries. Unlikeconventional numerical schemes based on discretizations of macroscopic con-tinuum equations, the lattice Boltzmann method is based on microscopic mod-els and mesoscopic kinetic equations. The fundamental idea of the LBM isto construct simplified kinetic models that incorporate the essential physics ofmicroscopic or mesoscopic processes so that the macroscopic averaged prop-erties obey the desired macroscopic equations. The basic premise for usingthese simplified kinetic-type methods for macroscopic fluid flows is that themacroscopic dynamics of a fluid is the result of the collective behavior of manymicroscopic particles in the system and that the macroscopic dynamics is notsensitive to the underlying details in microscopic physics (Kadanoff 1986).By developing a simplified version of the kinetic equation, one avoids solving

3290066-4189/98/0115-0329$08.00

9














��(x� �) ρ(x� �)u(x� �) = �

�e��(x� �) (13)







��(x� �+) = ��(x� �) − 1τ (��(x� �) − �eq

� (ρ(x)� u(x))) (14)

where

�eq� (ρ� u) = ρ(A + B(e�� u) + C(e�� u)2 + Du2) (15)




9














��(x� �) ρ(x� �)u(x� �) = �

�e��(x� �) (13)







��(x� �+) = ��(x� �) − 1τ (��(x� �) − �eq

� (ρ(x)� u(x))) (14)

where

�eq� (ρ� u) = ρ(A + B(e�� u) + C(e�� u)2 + Du2) (15)




10

��(x + e� � + δ�) = ��(x� �+) (16)It should be noted that the collision step involves a large number

of floating-point operations that are strictly local to each lattice site,while the streaming step contains no floating-point operation butsolely memory copies between nearest-neighbor lattice sites.

Due to its particulate nature and local dynamics, LBM has sev-eral advantages over other conventional CFD methods, especiallyin dealing with complex boundaries, incorporating of microscopicinteractions, and parallelization of the algorithm [37]. Even thoughthe LBM is based on a particle picture, its principal focus is the av-eraged macroscopic behavior. The kinetic equation provides manyof the advantages of molecular dynamics, including clear physi-cal pictures, easy implementation of boundary conditions, and fullyparallels algorithms. Because of the availability of very fast andmassively parallel machines, there is a current trend to use codesthat can exploit the intrinsic features of parallelism. The LBM ful-fills these requirements in a straightforward manner. Benchmarkshows LBM on parallel machines with fast internet connection haveclose to linear speed up when more processing units are added [38].

6 CFD based on GPU Programming And Its ProblemIn recent years, the GPU has attracted attention for numericalcomputing. Its structure is highly parallelized and optimized toachieve high performance for image processing. GPU can per-forms floating-point calculations, especially such can be translatedinto shading, blazingly fast thanks to the hardware accelerations.Moreover, GPU speed is improving dramatically over the past fiveyears, the acceleration is much greater than CPU [39].

Incorporating CFD calculations using a GPU chip first appearedon a GPU software-programming book from NVIDIA. The author

9














��(x� �) ρ(x� �)u(x� �) = �

�e��(x� �) (13)







��(x� �+) = ��(x� �) − 1τ (��(x� �) − �eq

� (ρ(x)� u(x))) (14)

where

�eq� (ρ� u) = ρ(A + B(e�� u) + C(e�� u)2 + Du2) (15)




9














��(x� �) ρ(x� �)u(x� �) = �

�e��(x� �) (13)







��(x� �+) = ��(x� �) − 1τ (��(x� �) − �eq

� (ρ(x)� u(x))) (14)

where

�eq� (ρ� u) = ρ(A + B(e�� u) + C(e�� u)2 + Du2) (15)




10






10






SPHSmoothed-particle hydrodynamics

Real-Time Particle-Based Simulation on GPUs (sap 0151)

Takahiro Harada! Masayuki Tanaka† Seiichi Koshizuka‡ Yoichiro Kawaguchi§

The University of Tokyo

Figure 1: Real-time simulation of glasses and liquid. Glass tower is filled with liquid and a glass is thrown into the scene. This simulationruns 17.1 frames per second on GeForce 8800GTX.

1 Introduction

As physical laws govern the motion of objects around us, aphysically-based simulation plays an important role in computergraphics. For instance, the motion of a fluid, which is difficult togenerate by hand, can be produced by solving the governing equa-tions. Acceleration of a simulation is one of the most importantresearch themes because the speed and stability of a simulation areessential for real-time applications.

The current trend in processor technology is to improve the effi-ciency of processors and not increase their frequency. Processorsnowadays are equipped with parallel architecture. Cell BroadbandEngine Architecture is a multi-core processor for general-purposecomputation and Graphics Processing Units (GPUs) are specializedparallel processors for graphics tasks. Additionally, CPUs are alsoshifting to multi-core design. What we need to do now is adapta-tion to these platforms. Therefore, we need to develop data-parallelalgorithms that exploit their computational powers.

In this sketch, we show that a particle-based simulation can be par-allelized and implemented entirely on Graphics Processing Units(GPUs) as a parallel computation platform. As a result, we canobtain unprecedented performance with scalar processors. We alsopresents a particle-based method to interact fluids and rigid bodies.In this method, rigid bodies are represented by a set of particles.The benefits of this method are low computational cost and paral-lelism of its algorithm.

2 Methods

Smoothed Particle Hydrodynamics (SPH) is employed to solve thegoverning equation of a fluid[Muller et al. 2003]. A characteris-tics of particle method including SPH is that there is no numericaldissipation caused by advection calculation and so mass loss doesnot occure even if the resolution of a simulation is low. Therefore,the particle methods are suited for a real-time application. As forthe rigid body simulation, a rigid body is represented by a set of

!e-mail: [email protected]†e-mail:[email protected]‡e-mail:[email protected]§e-mail:[email protected]

particles (spheres) as the title of this skech implies. We call themrigid particles. The size of rigid particles is all the same and alsothe same to the size of fluid particles. An advantage of this shaperepresentation is the computation speed is controllable by chang-ing the accuracy, i.e., the resolution of particles. With this shaperepresentation, not only collision between rigid bodies but also in-teraction between a rigid body and a fluid can be converted to thesimple problem of computation of particle interactions. Thus, thecomputation is simple and it can be computed in parallel. How-ever, the shape representation using particles increases the numberof simulation entites because a rigid body is consists of a few rigidparticles. A uniform grid is introduced to make the neighboringparticle search efficient. The interaction between fluid particles andrigid particles is calculated by assuming rigid bodies as a fluid. Thedensity is also computed on rigid particles and then the pressure andviscosity forces are calculated between fluid particles. The force onthe rigid particle, which is the sum of the force from fluid and theforce from collisions between rigid particles, is used to update thelinear and anguler momenta of a rigid body.

As described above, the force computation between rigid bodiesand a fluid can be executed in parallel. When GPUs are used, aframe buffer is rendered with a fragment shader by assigning a pixelto a particle. The fragment shader compute the force with physicalvalues of neighboring particles which are read from other textures.A data which stores an information about neighboring particles isgenerated in advance by a vertex shader. In this way, forces on rigidbodies and a fluid are computed in parallel.

3 Results

In Figure 1, 10 glasses are stacked and a fluid is poured from abovethem. Then, a glass is thrown onto them and the stacked glasses col-lapse. This simulation uses 49,153 particles and runs 17.1 framesper second on GeForce 8800GTX using a rendering in which pointsprites are used to render particles. The simulator outputs simula-tion data and the surface of the fluid is constructed by MarchingCubes by assigning densities to fluid particles. The polygons arerendered after the simulation with an offline renderer. The accom-panying video includes several examples which runs in real-time. Asimulation which uses the largest particle number runs 3.85 framesper second with 245,760 particles. These examples show the capa-bility of the present technique.

References

MULLER, M., CHARYPAR, D., AND GROSS, M. 2003. Particle-based fluidsimulation for interactive applications. In Proc. of SIGGRAPH Sympo-sium on Computer Animation, 154–159.

Harada et al. use modern GPUs (GeForce 8800 GTX) to simulate 49,153 particles at 17 fps.

4

used with approximations based on knowledge of the properties

of flow turbulence to give approximate averaged solutions to the

Navier-Stokes equations. For a stationary, incompressible flow of

Newtonian fluid, these equations can be written as:

ρ∂��∂��

= ρ�� + ∂∂��

�−�δ�� + µ

�∂��∂��

+ ∂��∂��

�− ρ��

�� (2)

The K-epsilon model is one of the most common RANS turbu-

lence models. It is a two-equation model, which means, it includes

two extra transport equations to represent the turbulent properties

of the flow. This allows a two-equation model to account for his-

tory effects like convection and diffusion of turbulent energy. The

model is widely used in building science research, especially indoor

air quality and thermo distribution simulation [7], [8], [9], [10], [11].

The model is relatively simple. For turbulent kinetic energy �:

∂∂� (ρ�) + ∂

∂��(ρ��) = ∂

∂��

��µ + µ�

σ�

� ∂�∂��

�+ P� + P� − ρ� − YM + S�

(3)

For dissipation �∂∂� (ρ�)+ ∂

∂��(ρ��) = ∂

∂��

��µ + µ�

σ�

� ∂�∂��

�+C1�

�� (P� + C3�P�)−C2�ρ�2

� +S�

(4)

The K-epsilon model should be a reference implementation for

us since it is the most widely solver in building science. When we

developed a new solver, we should compare the results to the K-

epsilon model and see whether the new solver is accurate enough.

3 Fast Fluid Dynamics: A Tentative Way to Make CFD

Faster

There are several ways to simplify the Navier-Stokes equation nu-

merical solving method to make CFD calculations faster. One of

18

8. Sun, Huawei, Zhao, Lingying, Zhang, Yuanhui, Evaluating RNGk-[epsilon] models using PIV data for airflow in animal build-ings at different ventilation rates. ASHRAE Transactions, Janu-ary 1, 2007

9. TOMINAGA YOSHIHIDE, MOCHIDA AKASHI, et, al., Jour-nal of Architecture, Planning and Environmental Engineering,Comparison of performance of various revised k-.EPSILON.models applied to CFD analysis of flowfield around a high-risebuilding.

10. P. Neofytou, A.G. Venetsanos, et, al., CFD simulations of thewind environment around an airport terminal building, En-vironmental Modelling & Software, Volume 21, Issue 4, April2006, Pages 520-524

11. R. Panneer Selvam, Computation of flow around Texas Techbuilding using k-epsilon and Kato-Launder k-epsilon turbulencemodel, Engineering Structures Volume 18, Issue 11, November1996, Pages 856-860

12. Stam J. Stable fluids. In: Proceedings of 26th international con-ference on computer graphics and interactive techniques, SIG-GRAPH’99, Los Angeles; 1999.

13. Harris MJ. Real-time cloud simulation and rendering. Ph.D.thesis, University of North Carolina at Chapel Hill; 2003.

14. Song O-Y, Shin H, Ko H.- S. Stable but nondissipative water.ACM Transactions on Graphics 2005;24(1):81-97.

15. Zuo W, Chen Q. Validation of fast fluid dynamics for roomairflow. In: Proceedings of the 10th international IBPSA con-ference, Building Simulation 2007, Beijing, China; 2007.

Stable FluidsJos Stam

Alias wavefront

AbstractBuilding animation tools for fluid-like motions is an important andchallenging problem with many applications in computer graphics.The use of physics-based models for fluid flow can greatly assistin creating such tools. Physical models, unlike key frame or pro-cedural based techniques, permit an animator to almost effortlesslycreate interesting, swirling fluid-like behaviors. Also, the interac-tion of flows with objects and virtual forces is handled elegantly.Until recently, it was believed that physical fluid models were tooexpensive to allow real-time interaction. This was largely due to thefact that previous models used unstable schemes to solve the phys-ical equations governing a fluid. In this paper, for the first time,we propose an unconditionally stable model which still producescomplex fluid-like flows. As well, our method is very easy to im-plement. The stability of our model allows us to take larger timesteps and therefore achieve faster simulations. We have used ourmodel in conjuction with advecting solid textures to create manyfluid-like animations interactively in two- and three-dimensions.

CR Categories: I.3.7 [Computer Graphics]: Three-DimensionalGraphics and Realism—Animation

Keywords: animation of fluids, Navier-Stokes, stable solvers, im-plicit elliptic PDE solvers, interactive modeling, gaseous phenom-ena, advected textures

1 IntroductionOne of the most intriguing problems in computer graphics is thesimulation of fluid-like behavior. A good fluid solver is of greatimportance in many different areas. In the special effects industrythere is a high demand to convincingly mimic the appearance andbehavior of fluids such as smoke, water and fire. Paint programscan also benefit from fluid solvers to emulate traditional techniquessuch as watercolor and oil paint. Texture synthesis is another pos-sible application. Indeed, many textures result from fluid-like pro-cesses, such as erosion. The modeling and simulation of fluids is,of course, also of prime importance in most scientific disciplinesand in engineering. Fluid mechanics is used as the standard math-ematical framework on which these simulations are based. Thereis a consensus among scientists that the Navier-Stokes equationsare a very good model for fluid flow. Thousands of books and

Alias wavefront, 1218 Third Ave, 8th Floor, Seattle, WA 98101, [email protected]

articles have been published in various areas on how to computethese equations numerically. Which solver to use in practice de-pends largely on the problem at hand and on the computing poweravailable. Most engineering tasks require that the simulation pro-vide accurate bounds on the physical quantities involved to answerquestions related to safety, performance, etc. The visual appearance(shape) of the flow is of secondary importance in these applications.In computer graphics, on the other hand, the shape and the behav-ior of the fluid are of primary interest, while physical accuracy issecondary or in some cases irrelevant. Fluid solvers, for computergraphics, should ideally provide a user with a tool that enables herto achieve fluid-like effects in real-time. These factors are more im-portant than strict physical accuracy, which would require too muchcomputational power.In fact, most previous models in computer graphics were driven

by visual appearance and not by physical accuracy. Early flowmodels were built from simple primitives. Various combinations ofthese primitives allowed the animation of particles systems [15, 17]or simple geometries such as leaves [23]. The complexity of theflows was greatly improved with the introduction of random tur-bulences [16, 20]. These turbulences are mass conserving and,therefore, automatically exhibit rotational motion. Also the tur-bulence is periodic in space and time, which is ideal for motion“texture mapping” [19]. Flows built up from a superposition offlow primitives all have the disadvantage that they do not responddynamically to user-applied external forces. Dynamical modelsof fluids based on the Navier-Stokes equations were first imple-mented in two-dimensions. Both Yaeger and Upson and Gamitoet al. used a vortex method coupled with a Poisson solver to cre-ate two-dimensional animations of fluids [24, 8]. Later, Chen etal. animated water surfaces from the pressure term given by a two-dimensional simulation of the Navier-Stokes equations [2]. Theirmethod unlike ours is both limited to two-dimensions and is un-stable. Kass and Miller linearize the shallow water equations tosimulate liquids [12]. The simplifications do not, however, cap-ture the interesting rotational motions characteristic of fluids. Morerecently, Foster and Metaxas clearly show the advantages of us-ing the full three-dimensional Navier-Stokes equations in creatingfluid-like animations [7]. Many effects which are hard to key framemanually such as swirling motion and flows past objects are ob-tained automatically. Their algorithm is based mainly on the workof Harlow and Welch in computational fluid dynamics, which datesback to 1965 [11]. Since then many other techniques which Fos-ter and Metaxas could have used have been developed. However,their model has the advantage of being simple to code, since it isbased on a finite differencing of the Navier-Stokes equations andan explicit time solver. Similar solvers and their source code arealso available from the book of Griebel et al. [9]. The main prob-lem with explicit solvers is that the numerical scheme can becomeunstable for large time-steps. Instability leads to numerical sim-ulations that “blow-up” and therefore have to be restarted with asmaller time-step. The instability of these explicit algorithms setsserious limits on speed and interactivity. Ideally, a user should beable to interact in real-time with a fluid solver without having toworry about possible “blow ups”.In this paper, for the first time, we propose a stable algorithm

that solves the full Navier-Stokes equations. Our algorithm is very

5

the most popular ways is Fast Fluid Dynamics, which intend to

break the Navier-Stokes equations into several sub-equations, and

solve them one by one. The FFD scheme was originally proposed

for computer visualization and computer games [12], [13], [14].

The FFD calculation uses the Helmholtz-Hodge Decomposition,

which states that any vector field w can uniquely be decomposed

into the form:

w = u + ∇� (5)

Define an operator P which projects any vector field w onto

its divergence free part u = Pw. Because u has zero divergence,

the above equation can be written as ∇ · w = ∇2�. Then u can be

written as

u = w − ∇� (6)

So

∂u∂� = P(−(� · ∇)u + ν∇2u + f) (7)

Then FFD tries to solve the equation into the following for steps:

add force, advert, diffuse and project for the time step δ�.Let u0 = u(x� 0) representing the initial state. Start from the

solution w0(�) = u(x� �) of the previous time step. The first step is

to add the additional of the external force f. Assuming the force

does not vary considerably during the time step, we have:

w1(x) = w0(x) + δ�f(x� �) (8)

The next step accounts for the effect of advection (or convec-

tion) of the fluid on itself. A disturbance somewhere in the fluid

propagates according to the expression −(u·∇)u. This term makes

the Navier-Stokes equations non-linear. FFD uses simple treatment

to make it linear. At each time step the velocity of the fluid itself

6

moves all the fluid particles. Therefore, to obtain the velocity at apoint � at the new time � + δ� , FFD tried to back trace the pointx through the velocity field w1 over time δ�. This defines a pathp(x� �) corresponding to a partial stream- line of the velocity field.The new velocity at the point x is then set to the velocity that theparticle, now at x, had at its previous location a time δ� ago:

w2(x) = w1(p(x� −δ�)) (9)The third step solves for the effect of viscosity and is equivalent

to a diffusion equation:∂w2∂� = ν∇2w2 (10)

The equation can be solved using an implicit method:(I − νδ�∇2)w3(x) = w2(x) (11)

The fourth step involves the projection step, which makes theresulting field divergence free.

∇2� = ∇ · w3 w4 = w3 − ∇� (12)Some of the recent papers apply FFD to building simulation.

In 2007, Qinyan Chen publish a proceeding paper to describe theinitial work to validate FFD for room airflow to Building SimulationConference 2007 [15]. They second that thought by publishing acomprehensive conclusion on Indoor Air in 2009 [16]. The resultsshowed that the FFD is about 50 times faster than the CFD. The FFDcould correctly predict the laminar flow, such as a laminar flow ina lid-driven cavity at R� = 100. But the FFD has some problems incomputing turbulent flows due to the lack of turbulence treatments.Although the FFD can capture the major pattern of the flow, itcannot compute the flow profile as accurate as the CFD does. They

(a)

(b) (c)

(e)

(g)

(b)

(d)

(f)

Figure 4: Snapshots from our interactive fluid solver.

637

Fast Fluid DynamicsSimulation on the GPUMark J. HarrisUniversity of North Carolina at Chapel Hill

Chapter 38

38.1 Introduction

This chapter describes a method for fast, stable fluid simulation that runs entirely onthe GPU. It introduces fluid dynamics and the associated mathematics, and it describesin detail the techniques to perform the simulation on the GPU. After reading thischapter, you should have a basic understanding of fluid dynamics and know how tosimulate fluids using the GPU. The source code accompanying this book demonstratesthe techniques described in this chapter.

38.1 IntroductionFluids are everywhere: water passing between riverbanks, smoke curling from a glowingcigarette, steam rushing from a teapot, water vapor forming into clouds, and paintbeing mixed in a can. Underlying all of them is the flow of fluids. All are phenomenathat we would like to portray realistically in interactive graphics applications. Figure38-1 shows examples of fluids simulated using the source code provided with this book.

Fluid simulation is a useful building block that is the basis for simulating a variety ofnatural phenomena. Because of the large amount of parallelism in graphics hardware,the simulation we describe runs significantly faster on the GPU than on the CPU.Using an NVIDIA GeForce FX, we have achieved a speedup of up to six times over anequivalent CPU simulation.

650

apply Equation 16 at every grid cell, using the results of the previous iteration as inputto the next (x(k+1) becomes x(k)). Because Jacobi iteration converges slowly, we need toexecute many iterations. Fortunately, Jacobi iterations are cheap to execute on theGPU, so we can run many iterations in a very short time.

Initial and Boundary ConditionsAny differential equation problem defined on a finite domain requires boundary condi-tions in order to be well posed. The boundary conditions determine how we computevalues at the edges of the simulation domain. Also, to compute the evolution of theflow over time, we must know how it started—in other words, its initial conditions. Forour fluid simulation, we assume the fluid initially has zero velocity and zero pressureeverywhere. Boundary conditions require a bit more discussion.

During each time step, we solve equations for two quantities—velocity and pressure—and we need boundary conditions for both. Because our fluid is simulated on a rectan-gular grid, we assume that it is a fluid in a box and cannot flow through the sides of thebox. For velocity, we use the no-slip condition, which specifies that velocity goes to zeroat the boundaries. The correct solution of the Poisson-pressure equation requires pureNeumann boundary conditions: !p/!n = 0. This means that at a boundary, the rate ofchange of pressure in the direction normal to the boundary is zero. We revisit boundaryconditions at the end of Section 38.3.

38.3 ImplementationNow that we understand the problem and the basics of solving it, we can move forwardwith the implementation. A good place to start is to lay out some pseudocode for thealgorithm. The algorithm is the same every time step, so this pseudocode represents asingle time step. The variables u and p hold the velocity and pressure field data.

// Apply the first 3 operators in Equation 12.u = advect(u);u = diffuse(u);u = addForces(u);// Now apply the projection operator to the result.p = computePressure(u);u = subtractPressureGradient(u, p);

Chapter 38 Fast Fluid Dynamics Simulation on the GPU

In practice, temporary storage is needed, because most of these operations cannot beperformed in place. For example, the advection step in the pseudocode is more accu-rately written as:

uTemp = advect(u);swap(u, uTemp);

This pseudocode contains no implementation-specific details. In fact, the samepseudocode describes CPU and GPU implementations equally well. Our goal is toperform all the steps on the GPU. Computation of this sort on the GPU may be unfa-miliar to some readers, so we will draw some analogies between operations in a typicalCPU fluid simulation and their counterparts on the GPU.

38.3.1 CPU–GPU AnalogiesFundamental to any computer are its memory and processing models, so any applica-tion must consider data representation and computation. Let’s touch on the differencesbetween CPUs and GPUs with regard to both of these.

Textures = ArraysOur simulation represents data on a two-dimensional grid. The natural representationfor this grid on the CPU is an array. The analog of an array on the GPU is a texture.Although textures are not as flexible as arrays, their flexibility is improving as graphicshardware evolves. Textures on current GPUs support all the basic operations necessary toimplement a fluid simulation. Because textures usually have three or four color channels,they provide a natural data structure for vector data types with two to four components.Alternatively, multiple scalar fields can be stored in a single texture. The most basic oper-ation is an array (or memory) read, which is accomplished by using a texture lookup.Thus, the GPU analog of an array offset is a texture coordinate. We need at least twotextures to represent the state of the fluid: one for velocity and one for pressure. In orderto visualize the flow, we maintain an additional texture that contains a quantity carriedby the fluid. We can think of this as “ink.” Figure 38-4 shows examples of these textures,as well as an additional texture for vorticity, described in Section 38.5.1.

Loop Bodies = Fragment ProgramsA CPU implementation of the simulation performs the steps in the algorithm by loop-ing, using a pair of nested loops to iterate over each cell in the grid. At each cell, thesame computation is performed. GPUs do not have the capability to perform this innerloop over each texel in a texture. However, the fragment pipeline is designed to perform

38.3 Implementation 651

652 Chapter 38 Fast Fluid Dynamics Simulation on the GPU

identical computations at each fragment. To the programmer, it appears as if there is aprocessor for each fragment, and that all fragments are updated simultaneously. In theparlance of parallel programming, this model is known as single instruction, multipledata (SIMD) computation. Thus, the GPU analog of computation inside nested loopsover an array is a fragment program applied in SIMD fashion to each fragment.

Feedback = Texture UpdateIn Section 38.2.4, we described how we use Jacobi iteration to solve Poisson equations.This type of iterative method uses the result of an iteration as input for the next itera-tion. This feedback is common in numerical methods. In a CPU implementation, onetypically does not even consider feedback, because it is trivially implemented usingvariables and arrays that can be both read and written. On the GPU, though, the out-put of fragment processors is always written to the frame buffer. Think of the framebuffer as a two-dimensional array that cannot be directly read. There are two ways toget the contents of the frame buffer into a texture that can be read:

! Copy to texture (CTT) copies from the frame buffer to a texture.! Render to texture (RTT) uses a texture as the frame buffer so the GPU can write di-

rectly to it.

CTT and RTT function equally well, but have a performance trade-off. For the sake ofgenerality we do not assume the use of either and refer to the process of writing to atexture as a texture update.

Earlier we mentioned that, in practice, each of the five steps in the algorithm updates atemporary grid and then performs a swap. RTT requires the use of two textures toimplement feedback, because the results of rendering to a texture while it is bound for

Figure 38-4. The State Fields of a Fluid Simulation, Stored in TexturesFrom left to right, the fields are “ink,” velocity (scaled and biased into the range [0, 1], so zerovelocity is gray), pressure (blue represents low pressure, red represents high pressure), andvorticity (yellow represents counter-clockwise rotation, blue represents clockwise rotation).

In practice, temporary storage is needed, because most of these operations cannot beperformed in place. For example, the advection step in the pseudocode is more accu-rately written as:

uTemp = advect(u);swap(u, uTemp);

This pseudocode contains no implementation-specific details. In fact, the samepseudocode describes CPU and GPU implementations equally well. Our goal is toperform all the steps on the GPU. Computation of this sort on the GPU may be unfa-miliar to some readers, so we will draw some analogies between operations in a typicalCPU fluid simulation and their counterparts on the GPU.

38.3.1 CPU–GPU AnalogiesFundamental to any computer are its memory and processing models, so any applica-tion must consider data representation and computation. Let’s touch on the differencesbetween CPUs and GPUs with regard to both of these.

Textures = ArraysOur simulation represents data on a two-dimensional grid. The natural representationfor this grid on the CPU is an array. The analog of an array on the GPU is a texture.Although textures are not as flexible as arrays, their flexibility is improving as graphicshardware evolves. Textures on current GPUs support all the basic operations necessary toimplement a fluid simulation. Because textures usually have three or four color channels,they provide a natural data structure for vector data types with two to four components.Alternatively, multiple scalar fields can be stored in a single texture. The most basic oper-ation is an array (or memory) read, which is accomplished by using a texture lookup.Thus, the GPU analog of an array offset is a texture coordinate. We need at least twotextures to represent the state of the fluid: one for velocity and one for pressure. In orderto visualize the flow, we maintain an additional texture that contains a quantity carriedby the fluid. We can think of this as “ink.” Figure 38-4 shows examples of these textures,as well as an additional texture for vorticity, described in Section 38.5.1.

Loop Bodies = Fragment ProgramsA CPU implementation of the simulation performs the steps in the algorithm by loop-ing, using a pair of nested loops to iterate over each cell in the grid. At each cell, thesame computation is performed. GPUs do not have the capability to perform this innerloop over each texel in a texture. However, the fragment pipeline is designed to perform


652 Chapter 38 Fast Fluid Dynamics Simulation on the GPU

identical computations at each fragment. To the programmer, it appears as if there is aprocessor for each fragment, and that all fragments are updated simultaneously. In theparlance of parallel programming, this model is known as single instruction, multipledata (SIMD) computation. Thus, the GPU analog of computation inside nested loopsover an array is a fragment program applied in SIMD fashion to each fragment.

Feedback = Texture UpdateIn Section 38.2.4, we described how we use Jacobi iteration to solve Poisson equations.This type of iterative method uses the result of an iteration as input for the next itera-tion. This feedback is common in numerical methods. In a CPU implementation, onetypically does not even consider feedback, because it is trivially implemented usingvariables and arrays that can be both read and written. On the GPU, though, the out-put of fragment processors is always written to the frame buffer. Think of the framebuffer as a two-dimensional array that cannot be directly read. There are two ways toget the contents of the frame buffer into a texture that can be read:

! Copy to texture (CTT) copies from the frame buffer to a texture.! Render to texture (RTT) uses a texture as the frame buffer so the GPU can write di-

rectly to it.

CTT and RTT function equally well, but have a performance trade-off. For the sake ofgenerality we do not assume the use of either and refer to the process of writing to atexture as a texture update.

Earlier we mentioned that, in practice, each of the five steps in the algorithm updates atemporary grid and then performs a swap. RTT requires the use of two textures toimplement feedback, because the results of rendering to a texture while it is bound for

Figure 38-4. The State Fields of a Fluid Simulation, Stored in TexturesFrom left to right, the fields are “ink,” velocity (scaled and biased into the range [0, 1], so zerovelocity is gray), pressure (blue represents low pressure, red represents high pressure), andvorticity (yellow represents counter-clockwise rotation, blue represents clockwise rotation).

654

reciprocal of the grid scale !x. The texture wrap mode must be set to CLAMP_TO_EDGEso that back-tracing outside the range [0, N ] will be clamped to the boundary texels.The boundary conditions described later correctly update these texels so that this situa-tion operates correctly.

Listing 38-1. Advection Fragment Program

void advect(float2 coords : WPOS, // grid coordinatesout float4 xNew : COLOR, // advected qtyuniform float timestep,uniform float rdx, // 1 / grid scaleuniform samplerRECT u, // input velocityuniform samplerRECT x) // qty to advect

{// follow the velocity field "back in time"float2 pos = coords - timestep * rdx * f2texRECT(u, coords);

// interpolate and write to the output fragmentxNew = f4texRECTbilerp(x, pos);

}


Figure 38-5. Primitives Used to Update the Interior and Boundaries of the GridUpdating a grid involves rendering a quad for the interior and lines for the boundaries. Separatefragment programs are applied to interior and border fragments.

In this code, the parameter u is the velocity field texture, and x is the field that is to beadvected. This could be the velocity or another quantity, such as dye concentration.The function f4texRECTbilerp() is a utility to perform bilinear interpolation ofthe four texels closest to the texture coordinates passed to it. Because current GPUs donot support automatic bilinear interpolation in floating-point textures, we must imple-ment it with this type of code.

Viscous DiffusionWith the description of the Jacobi iteration technique given in Section 38.2.4, writinga Jacobi iteration fragment program is simple, as shown in Listing 38-2.

Listing 38-2. The Jacobi Iteration Fragment Program Used to Solve Poisson Equations

void jacobi(half2 coords : WPOS, // grid coordinatesout half4 xNew : COLOR, // resultuniform half alpha,uniform half rBeta, // reciprocal betauniform samplerRECT x, // x vector (Ax = b)uniform samplerRECT b) // b vector (Ax = b)

{// left, right, bottom, and top x sampleshalf4 xL = h4texRECT(x, coords - half2(1, 0));half4 xR = h4texRECT(x, coords + half2(1, 0));half4 xB = h4texRECT(x, coords - half2(0, 1));half4 xT = h4texRECT(x, coords + half2(0, 1));

// b sample, from centerhalf4 bC = h4texRECT(b, coords);

// evaluate Jacobi iterationxNew = (xL + xR + xB + xT + alpha * bC) * rBeta;

}

Notice that the rBeta parameter is the reciprocal of ! from Equation 16. To solve thediffusion equation, we set alpha to ("x)2/#"t , rBeta to 1/(4 + ("x)2/#"t), and the xand b parameters to the velocity texture. We then run a number of iterations (usually20 to 50, but more can be used to reduce the error).

Force ApplicationThe simplest step in our algorithm is computing the acceleration caused by externalforces. In the demonstration application found in the accompanying materials, you can


Listing 38-3. The Divergence Fragment Program

void divergence(half2 coords : WPOS, // grid coordinatesout half4 div : COLOR, // divergenceuniform half halfrdx, // 0.5 / gridscaleuniform samplerRECT w) // vector field

{half4 wL = h4texRECT(w, coords - half2(1, 0));half4 wR = h4texRECT(w, coords + half2(1, 0));half4 wB = h4texRECT(w, coords - half2(0, 1));half4 wT = h4texRECT(w, coords + half2(0, 1));

div = halfrdx * ((wR.x - wL.x) + (wT.y - wB.y));}

pressure field texture to the parameter p in the following program, which computes thegradient of p according to the definition in Table 38-1 and subtracts it from the inter-mediate velocity field texture in parameter w. See Listing 38-4.

Listing 38-4. The Gradient Subtraction Fragment Program

void gradient(half2 coords : WPOS, // grid coordinatesout half4 uNew : COLOR, // new velocityuniform half halfrdx, // 0.5 / gridscaleuniform samplerRECT p, // pressureuniform samplerRECT w) // velocity

{half pL = h1texRECT(p, coords - half2(1, 0));half pR = h1texRECT(p, coords + half2(1, 0));half pB = h1texRECT(p, coords - half2(0, 1));half pT = h1texRECT(p, coords + half2(0, 1));

uNew = h4texRECT(w, coords);uNew.xy -= halfrdx * half2(pR - pL, pT - pB);

}

Boundary ConditionsIn Section 38.2.4, we determined that our “fluid in a box” requires no-slip (zero) velocityboundary conditions and pure Neumann pressure boundary conditions. In Section38.3.2 we learned that we can implement boundary conditions by reserving the one-pixelperimeter of our grid for storing boundary values. We update these values by drawing lineprimitives over the border, using a fragment program that sets the values appropriately.


658

First we should look at how our grid discretization affects the computation of boundaryconditions. The no-slip condition dictates that velocity equals zero on the boundaries,and the pure Neumann pressure condition requires the normal pressure derivative to bezero at the boundaries. The boundary is defined to lie on the edge between the bound-ary cell and its nearest interior cell, but grid values are defined at cell centers. Therefore,we must compute boundary values such that the average of the two cells adjacent toany edge satisfies the boundary condition.

For the velocity boundary on the left side, for example, we have:

where N is the grid resolution. In order to satisfy this equation, we must set u0, j equalto –u1, j. The pressure equation works out similarly. Using the forward difference ap-proximation of the derivative, we get:

On solving this equation for p0, j,we see that we need to set each pressure boundaryvalue to the value just inside the boundary.

We can use a simple fragment program for both the pressure and the velocity bound-aries, as shown in Listing 38-5.

Listing 38-5. The Boundary Condition Fragment Program

void boundary(half2 coords : WPOS, // grid coordinateshalf2 offset : TEX1, // boundary offsetout half4 bv : COLOR, // output valueuniform half scale, // scale parameteruniform samplerRECT x) // state field

{bv = scale * h4texRECT(x, coords + offset);

}

Figure 38-6 demonstrates how this program works. The x parameter represents thetexture (velocity or pressure field) from which we read interior values. The offsetparameter contains the correct offset to the interior cells adjacent to the current bound-ary. The coords parameter contains the position in texture coordinates of the frag-ment being processed, so adding offset to it addresses a neighboring texel. At each

p p

xj j1 0 0, , .!

=!

(18)

u u0 1

20 0, , , , ,j j j N

+= " [ ] for (17)


662

38.5.1 Vorticity ConfinementThe motion of smoke, air and other low-viscosity fluids typically contains rotationalflows at a variety of scales. This rotational flow is vorticity. As Fedkiw et al. explained,numerical dissipation caused by simulation on a coarse grid damps out these interestingfeatures (Fedkiw et al. 2001). Therefore, they used vorticity confinement to restore thesefine-scale motions. Vorticity confinement works by first computing the vorticity, ! = ! " u. From the vorticity we compute a normalized vorticity vector field:

Here, The vectors in this vector field point from areas of lower vorticity toareas of higher vorticity. From these vectors we compute a force that can be used torestore an approximation of the dissipated vorticity:

" != ! .

# ""

= .


Figure 38-7. Cloud SimulationA sequence of frames (20 iterations apart) from a two-dimensional cloud simulation running on aGPU.

Fast and informative flow simulations in a building by using fast fluid dynamicsmodel on graphics processing unit

Wangda Zuo, Qingyan Chen*

National Air Transportation Center of Excellence for Research in the Intermodal Transport Environment (RITE), School of Mechanical Engineering, Purdue University, 585 Purdue Mall,West Lafayette, IN 47907-2088, USA

a r t i c l e i n f o

Article history:Received 14 April 2009Received in revised form17 August 2009Accepted 19 August 2009

Keywords:Graphics Processing Unit (GPU)Airflow simulationFast Fluid Dynamics (FFD)Parallel computingCentral Processing Unit (CPU)

a b s t r a c t

Fast indoor airflow simulations are necessary for building emergency management, preliminary design ofsustainable buildings, and real-time indoor environment control. The simulation should also be infor-mative since the airflow motion, temperature distribution, and contaminant concentration are impor-tant. Unfortunately, none of the current indoor airflow simulation techniques can satisfy bothrequirements at the same time. Our previous study proposed a Fast Fluid Dynamics (FFD) model forindoor flow simulation. The FFD is an intermediate method between the Computational Fluid Dynamics(CFD) and multizone/zonal models. It can efficiently solve Navier–Stokes equations and other trans-portation equations for energy and species at a speed of 50 times faster than the CFD. However, thisspeed is still not fast enough to do real-time simulation for a whole building. This paper reports ourefforts on further accelerating FFD simulation by running it in parallel on a Graphics Processing Unit(GPU). This study validated the FFD on the GPU by simulating the flow in a lid-driven cavity, channelflow, forced convective flow, and natural convective flow. The results show that the FFD on the GPU canproduce reasonable results for those indoor flows. In addition, the FFD on the GPU is 10–30 times fasterthan that on a Central Processing Unit (CPU). As a whole, the FFD on a GPU can be 500–1500 times fasterthan the CFD on a CPU. By applying the FFD to the GPU, it is possible to do real-time informative airflowsimulation for a small building.

! 2009 Elsevier Ltd. All rights reserved.

1. Introduction

According to the United States Fire Administration [1], 3430civilians and 118 firefighters lost their lives in fires in 2007, with anadditional 17,675 civilians injured. Smoke inhalation is responsiblefor most fire-related injuries and deaths in buildings. Computersimulations can predict the transportation of poisonous air/gas inbuildings. If the prediction is in real-time or faster-than-real-time,firefighters can follow appropriate rescue plans to minimize casu-alties. In addition, to design sustainable buildings that can providea comfortable and healthy indoor environment with less energyconsumption, it is essential to know the distributions of air velocity,air temperature, and contaminant concentration in buildings. Flowsimulations in buildings can provide this information [2]. Again, thepredictions should be rapid due to the limited time available duringthe design process. Furthermore, one can optimize building HVACcontrol systems if the indoor environment can be simulated in real-time or faster-than-real-time.

However, none of the current flow simulation techniques forbuildings can satisfy the requirements for obtaining results quicklyand informatively. For example, CFD is an important tool instudying flowand contaminant transport in buildings [3]. But whenthe simulated flow domain is large or the flow is complex, the CFDsimulation requires a large amount of computing meshes. Conse-quently, it needs a very long computing time if it is only usinga single processor computer [4].

A typical approach to reduce the computing time for indoorairflow simulations is to reduce the order of flow simulationmodels. Zonal models [5] divide a room into several zones andassume that air property in a zone is uniform. Based on thisassumption, zonal models only compute a few nodes for a room togreatly reduce related computing demands. Multizone models [6]expand the uniform assumption to the whole room so that thenumber of computing nodes can be further reduced. Theseapproaches are widely used for air simulations in a whole building.However, the zonal and multizone models solve only the masscontinuity, energy, and species concentration equations but not themomentum equations. They are fast but not accurate enough sincethey can only provide the bulk information of each zone without

* Corresponding author. Tel.: !1 765 496 7562; fax: !1 765 494 0539.E-mail addresses: [email protected] (W. Zuo), [email protected] (Q. Chen).

Contents lists available at ScienceDirect

Building and Environment

journal homepage: www.elsevier .com/locate/bui ldenv

0360-1323/$ – see front matter ! 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.buildenv.2009.08.008

Building and Environment 45 (2010) 747–757

the details about the airflow and contaminant transport inside thezone [6].

Recently, an FFD method [7] has been proposed for fast flowsimulations in buildings as an intermediate method between theCFD and zonal/multizone models. The FFD method solves thecontinuity equation and unsteady Navier–Stokes equations asthe CFD does. By using a different numerical scheme to solve thegoverning equations, the FFD can run about 50 times faster than theCFD with the same numerical setting on a single CPU [8]. Althoughthe FFD is not as accurate as the CFD, it can provide more detailedinformation than a multizone model or a zonal model.

Although the FFD is much faster than the CFD, its speed is stillnot fast enough for the real-time flow simulation in a building. Forexample, our previous work [8] found that the FFD simulation canbe real-time with 65,000 grids. If a simulation domain with30! 30! 30 grids is applied for a room, the FFD code can onlysimulate the airflow in 2–3 rooms on real-time. Hence, if we wantto do real-time simulation for a large building, we have to furtheraccelerate the FFD simulation.

To reduce the computing time, many researchers have per-formed the flow simulations in parallel on multi-processorcomputers [9,10]. It is also possible to speed up the FFD simulationby running it in parallel on a multi-processor computer. However,this approach needs large investments in equipment purchase andinstallation and a designated space for installing the computers andthe related capacity of the cooling system used in the space. Inaddition, the fees for the operation and maintenance of a multi-processor computer are also nearly the same as those of severalsingle processor computers of the same capacity. Hence, multi-processor computers are a luxury for building designers or emer-gency management teams.

Recently, the GPU has attracted attention for parallel computing.Different from a CPU, the GPU is the core of a computer graphicscard and integrates multiple processors on a single chip. Its struc-ture is highly parallelized to achieve high performance for imageprocessing. For example, an NVIDA GeForce 8800 GTX GPU, avail-able since 2006, integrates 128 processors so that its peakcomputing speed is 367 GFLOPS (Giga FLoating point Operation PerSecond). Comparatively, the peak performance of an INETL Core2Duo 3.0 GHz CPU available at the same time is only about 32GFLOPS [11]. Fig.1 compares the computing speeds of CPU and GPU.The speed gap between the CPU and the GPU has been increasingsince 2003. Furthermore, this trend is likely to continue in thefuture. Besides GPU’s high performance, the cost of a GPU is low. Forexample, a graphics card with NVIDIA GeForce 8800 GTX GPU costsonly around $500. It can easily be installed onto a personalcomputer and there are no other additional costs.

Thus, it seems possible to realize fast and informative indoorairflow simulations by using the FFD on a GPU. This paper reports

our efforts to implement the FFD model in parallel on an NVIDIAGeForce 8800 GTX GPU. The GPU code was then validated bysimulating several flows that consist of the basic features of indoorairflows.

2. Fast fluid dynamics

Our investigation used the FFD scheme proposed by Stam [7].The FFD applies a splitting method to solve the continuity equation(1) and Navier–Stokes equation (2) for an unsteady incompressibleflow:

vUivxi

" 0; (1)

vUivt

" #UjvUivxj

$ nv2Ui

vx2j# 1

rvPvxi

$ fir; (2)

where Ui and Uj are fluid velocity components in xi and xj directions,respectively; n is kinematic viscosity; r is fluid density; P is pres-sure; t is time; and fi are body forces, such as buoyancy force and

0

50

100

150

200

250

300

350

400

2003 2004 2005 2006 2007

GFL

OPS

year

GPUCPU

Fig. 1. Comparison of the computing speeds of GPU (NVIDIA) and CPU (INTEL) since2003 [11].

Nomenclature

ai,j, bi,j equation coefficient (dimensionless)C contaminant concentration (kg/m3)fi body force (kg/m2 s2)H the width of the room (m)i, j mesh node indiceskC contaminant diffusivity (m2/s)kT thermal diffusivity (m2/s)L length scale (m)P pressure (kg/m s2)SC contaminant source (kg/m3 s)ST heat source (%C/s)

T temperature (%C)t time (s)uij velocity components at mesh node (i, j) (m/s)Ub bulk velocity (m/s)Ui, Uj velocity components in xi and xj directions,

respectively (m/s)U horizontal velocity or velocity scale (m/s)V vertical velocity (m/s)xi, xj spatial coordinatesx, y spatial coordinatesDt time step (s)n kinematic viscosity (m2/s)0 previous time step


other external forces. The FFD splits the Navier–Stokes equation (2)into three simple equations (3)–(5). Then it solves them one by one.

vUivt

! "UjvUivxj

; (3)

vUivt

! nv2Ui

vx2j# fi

r; (4)

vUivt

! "1rvPvxi

; (5)

Equation (3) can be reformatted as

vUivt

# UjvUivxj

! DUiDt

! 0; (6)

where DUi/Dt is material derivative. This means that if we followa flow particle, the flow properties, such as velocities Ui, on thisparticle, will not change with time. Therefore, one can get the valueof Ui by finding its value at the previous time step. The currentstudy used a first order semi-Lagrangian approach [12] to calculatethe value of Ui.

Equation (4) is a typical unsteady diffusion equation. One caneasily solve it by using an iterative scheme such as Gauss-Seideliteration or Jacobi iteration. This work has applied the Jacobi iter-ation since it can solve the equation in parallel.

Finally, it ensures mass conservation by solving equations (1)and (5) together with a pressure-correction projectionmethod [13].The idea of the projection method is that the pressure should beadjusted so that the velocities satisfy the mass conservation.Assuming Ui

0 is the velocity obtained from equation (4), equation(5) can be expanded to

Ui " U0i

Dt! "1

rv

vxiP; (7)

where Dt is time step size and Ui is the unknown velocity, whichsatisfy the continuity Equation (1):

vUivxi

! 0: (8)

Substituting equation (7) into (8), one can get

vU0i

vxi! "

Dtr

v2Pvx2i

: (9)

Solving equation (9), one can obtain P. Substituting P intoequation (7), Ui will be known.

The energy equation can be written as:

vTvt

! "UjvTvxj

# kTv2Tvx2j

# ST ; (10)

Host (CPU)

Device (GPU)

Grid 1Grid 2

Grid 3, 4, …….

Block(2,2)

Block(1,2)

Block(0,2)

Block(2,1)

Block(1,1)

Block(0,1)

Block(2,0)

Block(1,0)

Block(0,0)

Grid 1 Grid 2 Grid 3, 4, ……

Block(2,2)

Block(1,2)

Block(0,2)

Block(2,1)

Block(1,1)

Block(0,1)

Block(2,0)

Block(1,0)

Block(0,0)

Thread (2,2)

Thread (1,2)

Thread (0,2)

Thread(2,1)

Thread (1,1)

Thread (0,1)

Thread (2,0)

Thread (1,0)

Thread(0,0)

Block(0,0)

…………

Thread (2,2)

Thread (1,2)

Thread (0,2)

Thread(2,1)

Thread (1,1)

Thread (0,1)

Thread (2,0)

Thread (1,0)

Thread(0,0)

Block(1,0)

…………

Fig. 2. The schematic of parallel computing on CUDA.

W. Zuo, Q. Chen / Building and Environment 45 (2010) 747–757 749

where T is temperature, kT is thermal diffusivity, and ST is heatsource. The FFD solves the equation (10) in a similar way as equa-tion (2) except for the pressure-correction projection for massconservation.

Very similarly, the FFD also determines concentrations ofspecies by the following transportation equation:

vCvt

! "UjvCvxj

# kCv2Cvx2j

# SC; (11)

where C is the species concentration, kC is the diffusivity, and SC isthe source.

The FFD scheme was originally proposed for computer visuali-zation and computer games [7,14,15]. In our previous work [8,16],the authors have studied the performance of the FFD scheme forindoor environment by computing different indoor airflows. Theresults showed that the FFD is about 50 times faster than the CFD.The FFD could correctly predict the laminar flow, such as a laminarflow in a lid-driven cavity at Re! 100 [16]. But the FFD has some

Read Parameters

AllocateCPU Memory

InitializeCPU Variables

FFD Solver

Finish?

AllocateGPU Memory

InitializeGPU Variables

Write Data File

FreeCPU Memory

FreeGPU Memory

End

Send Data to GPU

Receive Data from CPU

Send Data to CPU

Receive Data from GPU

CPU GPU

YesNo

Command Command and Data

Fig. 3. The schematic for implementing the FFD on the GPU.

Block

(2,2)

Block

(1,2)

Block

(0,2)

Block

(2,1)

Block

(1,1)

Block

(0,1)

Block

(2,0)

Block

(1,0)

Block

(0,0)

Fig. 4. Allocation of mesh nodes to GPU blocks.


simultaneously hold up to 12,288 threads. Because CUDA does notallow one block to spread into two SMs, the allocation of the blocksis crucial to employ the full capacity of a GPU. For example, if a blockhas 512 threads, then only one block can be assigned to one SM andthe rest of the 256 threads in that SM are unused. If a block contains256 threads, then 3 blocks can share all the 768 threads of an SM sothat the SM can be fully used. Theoretically, the 8800 GTX GPU canreach its peak performance when all 12,288 threads are running atthe same time. Practically, the peak performance also depends onmany other factors, such as the time for reading or writing datawith the memory.

4. Implementation

The FFDwas implemented on the GPU by using CUDAversion 1.1[11]. Fig. 3 shows the program structure. The implementation usedthe CPU to read, initialize, and write the data. The FFD parallelsolver, which is the core of the program, runs on the GPU.

Our program assigned one thread for each mesh node. Theimplementation further defined a block with a two-dimensionalmatrix that contained (16!16" 256) threads. By this means, anSM used three blocks to utilize all of its 768 threads. For simplicity,the current implementation only adopted one grid for all theblocks. As a result, the number of threads on each dimension of thegrid was the multiplication of 16. However, the number of meshnodes on each dimension may not always be the multiplication of16. For instance, the mesh (shaded part) in Fig. 4 would not fit intofour blocks (0,0; 0,1; 1,0; and 1,1). Thus, it is necessary to use nineblocks for the mesh. Consequently, some threads in those fiveadditional blocks (0,2; 1,2; 2,0; 2,1; and 2,2) could be idled sincethey did not have mesh nodes. Although this strategy is not themost optimal, its implementation is the easiest.

The FFD parallel solver on the GPU is the core of our program.The solver consists of different functions for the split equations (3)–(5) in the governing equations. However, the implementations ofvarious functions are similar in principle. Fig. 5 demonstrates theschematic employed in solving the diffusion equation (4) forvelocity component ui,j. Before the iteration starts, our programdefines the dimensions of grids and blocks for the parallelcomputing. In each iteration, the program first solves ui,j at theinterior nodes in parallel, then ui,j at the boundary nodes.

In the parallel job, it is important to map the thread indices(threadID.x, threadID.y) in a block onto the coordinate of the meshnodes (i, j). The ‘‘Locate Thread (i, j)’’ step in Fig. 5 applied thefollowing formulas:

i " blockDim:x! blockID:x# threadID:x; (12)

j " blockDim:y! blockID:y# threadID:y: (13)

where blockID.x and blockID.y are the indices of the block whichcontains this thread. The blockDim.x and blockDim.y are the blockdimensions at x and y directions, respectively. Both of them are 16in our program.

i, j i+1, j

i, j+1

i–1, j

i, j-1

Fig. 6. Coordinates for the computing meshes.

U=1m/s

U=V=0 U=V=0

U=V=0

Fig. 7. Schematic of the flow in a square lid-driven cavity.

0

0.2

0.4

0.6

0.8

1

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

y(m

)

U(m/s)

GPUGHIA

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0 0.2 0.4 0.6 0.8 1

V(m

/s)

x(m)

GPUGHIA

a b

Fig. 8. a. Comparison of the calculated horizontal velocity profile (Re" 100) at x" 0.5 m with Ghia’s data [21]. b. Comparison of the calculated vertical velocity profile (Re" 100) aty" 0.5 m with Ghia’s data [21].


simultaneously hold up to 12,288 threads. Because CUDA does notallow one block to spread into two SMs, the allocation of the blocksis crucial to employ the full capacity of a GPU. For example, if a blockhas 512 threads, then only one block can be assigned to one SM andthe rest of the 256 threads in that SM are unused. If a block contains256 threads, then 3 blocks can share all the 768 threads of an SM sothat the SM can be fully used. Theoretically, the 8800 GTX GPU canreach its peak performance when all 12,288 threads are running atthe same time. Practically, the peak performance also depends onmany other factors, such as the time for reading or writing datawith the memory.

4. Implementation

The FFDwas implemented on the GPU by using CUDAversion 1.1[11]. Fig. 3 shows the program structure. The implementation usedthe CPU to read, initialize, and write the data. The FFD parallelsolver, which is the core of the program, runs on the GPU.

Our program assigned one thread for each mesh node. Theimplementation further defined a block with a two-dimensionalmatrix that contained (16!16" 256) threads. By this means, anSM used three blocks to utilize all of its 768 threads. For simplicity,the current implementation only adopted one grid for all theblocks. As a result, the number of threads on each dimension of thegrid was the multiplication of 16. However, the number of meshnodes on each dimension may not always be the multiplication of16. For instance, the mesh (shaded part) in Fig. 4 would not fit intofour blocks (0,0; 0,1; 1,0; and 1,1). Thus, it is necessary to use nineblocks for the mesh. Consequently, some threads in those fiveadditional blocks (0,2; 1,2; 2,0; 2,1; and 2,2) could be idled sincethey did not have mesh nodes. Although this strategy is not themost optimal, its implementation is the easiest.

The FFD parallel solver on the GPU is the core of our program.The solver consists of different functions for the split equations (3)–(5) in the governing equations. However, the implementations ofvarious functions are similar in principle. Fig. 5 demonstrates theschematic employed in solving the diffusion equation (4) forvelocity component ui,j. Before the iteration starts, our programdefines the dimensions of grids and blocks for the parallelcomputing. In each iteration, the program first solves ui,j at theinterior nodes in parallel, then ui,j at the boundary nodes.

In the parallel job, it is important to map the thread indices(threadID.x, threadID.y) in a block onto the coordinate of the meshnodes (i, j). The ‘‘Locate Thread (i, j)’’ step in Fig. 5 applied thefollowing formulas:

i " blockDim:x! blockID:x# threadID:x; (12)

j " blockDim:y! blockID:y# threadID:y: (13)

where blockID.x and blockID.y are the indices of the block whichcontains this thread. The blockDim.x and blockDim.y are the blockdimensions at x and y directions, respectively. Both of them are 16in our program.

i, j i+1, j

i, j+1

i–1, j

i, j-1

Fig. 6. Coordinates for the computing meshes.

U=1m/s

U=V=0 U=V=0

U=V=0

Fig. 7. Schematic of the flow in a square lid-driven cavity.

0

0.2

0.4

0.6

0.8

1

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

y(m

)

U(m/s)

GPUGHIA

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0 0.2 0.4 0.6 0.8 1V

(m/s

)x(m)

GPUGHIA

a b

Fig. 8. a. Comparison of the calculated horizontal velocity profile (Re" 100) at x" 0.5 m with Ghia’s data [21]. b. Comparison of the calculated vertical velocity profile (Re" 100) aty" 0.5 m with Ghia’s data [21].


For simplicity, the following part describes how velocitycomponent ui,j at the interior nodes is solved. For a two-dimen-sional flow, the diffusion term in Equation (2) is:

vuvt

! n

v2uvx2

" v2uvy2

!

; (14)

By applying a first order implicit timing scheme, one coulddiscretize Equation (14) into

ut"1 # ut

Dt! n

v2ut"1

vx2"v2ut"1

vy2

!

; (15)

where Dt is the time step, and the superscripts t and t" 1 representprevious and current time steps, respectively. Fig. 6 illustrates thecoordinates of the mesh. At mesh node (i, j), one can discretizeequation (15) in the space as:

ai;jut"1i;j " ai#1;ju

t"1i#1;j " ai"1;ju

t"1i"1;j " ai;j#1u

t"1i;j#1 " ai;j"1u

t"1i;j"1

! bi;j; $16%

where ai,j, ai#1,j, ai"1,j, ai,j#1 and ai,"1j are known coefficients. The bi,jon the right hand side of Equation (16), which contains ui,jt, is alsoknown. By this means, one can get a system of equations for all theinterior nodes. The equations can be solved in parallel by using theJacobi iteration.

In general, our implementation of the FFD parallel solver on theGPU used the same principles as other parallel computing on

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0 0.2 0.4 0.6 0.8 1

V(m

/s)

x(m)

FFD on GPUGHIA

0

0.2

0.4

0.6

0.8

1

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

y(m

)

U(m/s)

FFD on GPUGHIA

a b

Fig. 9. a. Comparison of the calculated horizontal velocity profiles (Re! 10,000) at x! 0.5 m with Ghia’s data [21]. b. Comparison of the calculated vertical velocity profile(Re! 10,000) at y! 0.5 m with Ghia’s data [21].

X

Y

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1a b

Fig. 10. a. Calculated streamlines for lid-driven cavity flow at Re! 10,000. b. Ghia’s data [21] for streamlines for lid-driven cavity flow at Re! 10,000 (reprint with permission fromElsevier).

Uin

Fig. 11. Schematic of the fully developed flow in a plane channel.


a multi-processor supercomputer. For more information on parallelcomputing, one can refer to books [18–20].

5. Results and discussion

To evaluate the FFD on the GPU for indoor airflow simulation,this study compared the results of the FFD on the GPU with thereference data. In addition, it was interesting to see the speed of thesimulations.

5.1. Evaluation of the results

The evaluation was performed by using the FFD on the GPU tocalculate four airflows relevant to the indoor environment. The fourflows were the flow in a lid-driven cavity, the fully developed flowin a plane channel, the forced convective flow in an empty room,and the natural convective flow in a tall cavity. The simulationresults are compared with the data from the literature.

5.1.1. Flow in a square cavity driven by a lidAir recirculated in a room is like the flow in a lid-driven cavity

(Fig. 7). This flow is also a classical case for numerical validation[21]. This investigation studied both laminar and turbulent flows.Based on the lid velocity of U! 1 m/s, cavity length of L! 1 m, andkinematic viscosity of the fluid, the Reynolds number of the laminarflow was 100 and the turbulent one was 10,000. A mesh with65" 65 grid points was enough for a laminar flow with Re! 100.Since the FFD model had no turbulence model, it required a densemesh for the highly turbulent flow if an accurate result was desired.Thus, this study applied a fine mesh with 513" 513 grid points forthe flowat Re! 10,000. The reference datawas the high quality CFDresults obtained by Ghia et al. [21].

Fig. 8 compares the computed velocity profiles of the laminarflow (Re! 100) at the vertical (Fig. 8a) and horizontal (Fig. 8b) mid-sections with the reference data. The predictions by FFD on GPU are

the same as those for Ghia’s data for laminar flow. These resultsshow that the FFD model works well for laminar flow.

The flow at Re! 10,000 is highly turbulent. Although the currentFFD model has no turbulence treatment, it could still provide veryaccurate results by using dense mesh (513" 513). As shown inFig. 9, the FFD on the GPU was able to accurately calculate thevelocities at both vertical and horizontal mid-sections of the cavity.The predicted velocity profiles agree with the reference data. Fig. 10compares the streamlines calculated by the FFD with referencesones [21]. The predicted profiles (Fig. 10a) of the vortices are similarto those of the reference one (Fig. 10b). The FFD on the GPUsuccessfully computed not only the primary recirculation in thecenter of the cavity, but also the secondary vortices in the upper-left, lower-left, and lower-right corners. There were one anti-clockwise rotation in the upper-left corner, one anti-clockwise, andone smaller clockwise rotation in both the lower-left and lower-right corners. Although this is a simple case, it proves that the GPUcould be used for numerical computing as the CPU.

5.1.2. Flow in a fully developed plane channelThe flow in a long corridor can be simplified as a fully developed

flow in a plane channel (Fig. 11). The Reynolds number of the flowstudied was 2800, based on the mean bulk velocity Ub and the halfchannel height, H. A mesh with 65" 33 grid points was adopted bythe FFD simulations. The Direct Numerical Simulation (DNS) datafrom Mansour et al. [22] was selected as a reference. Fig. 12compares the predicted velocity profiles by the FFD on both theCPU and the GPU with the DNS data. Different from the turbulentprofile drawn by the DNS data, the FFD on the GPU, gave morelaminar like profiles. As discussed by the authors [8], this laminarprofile was caused by a lack of turbulence treatment in the currentFFD model. Nevertheless, the GPU worked properly and the FFD onthe GPU was the same as that on the CPU for this case.

5.1.3. Flow in an empty room with forced convectionA forced convection flow in an empty room represents flows in

mechanically ventilated rooms (Fig.13). The studywas based on theexperiment by Nielson [23]. His experimental data showed that theflow in the room can be simplified into two-dimensions. The heightof tested room, H, is 3 m and the width is 3H. The inlet was in theupper-left corner with a height of 0.56H. The outlet height was0.16H and located in the lower-right corner. The Reynolds numberwas 5000, based on the inlet height and inlet velocity, which canlead to turbulent flow in a room. This study employed a mesh of37" 37 grid points.

Fig. 14 compares the predicted horizontal velocity profiles at thecenters of the room (x!H and 2H) and at the near wall regions(y! 0.028H and 0.972H) with the experimental data. As expected,

hin

hout

L =3H

Hy

x

Uin= 0.455 m/s

Fig. 13. Schematic of a forced convective flow in an empty room.

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

U/U

b

y/H

GPUCPUDNS

Fig. 12. Comparison of the mean velocity profile in a fully developed channel flowpredicted by the FFD on a CPU and a GPU with the DNS data [22].


Proceedings: Building Simulation 2007

- 983 -

X

Y

0 2 4 6 80

2

(a) FFD

(b) Standard k- model (Chen 1995)

(c) LES (Su et al. 2001)

Figure 7 The velocity field predicted by different numerical methods.

The N strongly depends on number of grids and time step size. A coarse grid size and large time steps can accelerate the simulation but accordingly degrade the accuracy. Therefore, one has to find a trade-off between the computational performance and accuracy. For the three cases, the FFD simulations were faster than the real time on a Dell Inspiron laptop with an Intel Core 2 CPU T200 at 2.00 GHz. Table 1 lists the performance of the FFD simulations. Although this CPU is dual core, the FFD simulations used only one processor.

Table 1 Performance of FFD

CASE GRIDS !"(s) N Lid-driven cavity 20 ! 20 0.1 44.5 Plane channel 32 ! 8 0.05 30.3 Ventilated room 300 ! 125 0.5 2.4

CONCLUSION The Fast Fluid Dynamics (FFD) method based on semi-Lagrangian method was validated for three different flows: flow in a lid-driven cavity, flow in a plane channel, and flow in a ventilated room. The accuracy of the FFD method has been evaluated by comparing the predicted results with the experimental and reference CFD data. The FFD method can predict the flow with acceptable accuracy at a speed faster than the real time.

ACKNOWLEDGMENT This project was funded by U.S. Federal Aviation Administration (FAA) Office of Aerospace Medicine through the Air Transportation Center of Excellence

for Airliner Cabin Environment Research under Cooperative Agreement 04-C-ACE-PU. Although the FAA has sponsored this project, it neither endorses nor rejects the findings of this research. The presentation of this information is in the interest of invoking technical community comment on the results and conclusions of research.

REFERENCES Bozeman JD. and Dalton C. 1973. "Numerical Study

of Viscous Flow in a Cavity," Journal of Computational Physics, 12(3): 348-363.

Chen Q. 1995. "Comparison of Different K-Epsilon Models for Indoor Air-Flow Computations," Numerical Heat Transfer Part B-Fundamentals, 28(3): 353-369.

Erturk E, Corke TC, and Gokcol C. 2005. "Numerical solutions of 2-D steady incompressible driven cavity flow at high Reynolds numbers," International Journal for Numerical Methods in Fluids, 48(7): 747-774.

Ghia U, Ghia KN, and Shin CT. 1982. "High-Re Solutions for Incompressible Flow Using the Navier-Stokes Equations and a Multigrid Method," Journal of Computational Physics, 48(3): 387-411.

Kim J, Moin P, and Moser R. 1987. "Turbulence Statistics in Fully-Developed Channel Flow at Low Reynolds-Number," Journal of Fluid Mechanics, 177: 133-166.

Restivo A. 1979. "Turbulent flow in ventilated room," Ph.D. Thesis, University of London (U.K.).

Robert A, Turnbull C, and Henderso J. 1972. "Implicit Time Integration Scheme for Baroclinic Models of Atmosphere," Monthly Weather Review, 100(5): 329-335.

Su M, Chen Q, and Chiang C. 2001. "Comparison of different subgrid-scale models of large eddy simulation for indoor airflow modeling," Journal of Fluids Engineering-Transactions of the ASME, 123(3): 628-639.

Wang L. 2007. "Coupling of Multizone and CFD Programs for Building Airflow and Contaminant Transport Simulations," Ph.D. Thesis, Purdue University.

the FFD on the GPU could capture major characteristics of flowvelocities (Fig. 14a and 14b). But the differences between theprediction and experimental data are large at the near wall region(Fig. 14c and 14d) since we only applied a simple non-slip wallboundary condition. Advanced wall function may improve theresults, but it will make the code more complex and require morecomputing time.

5.1.4. Flow in a natural convective tall cavityThe flows in the previous three cases were isothermal. The FFD

on the GPU was further validated by using a non-isothermal flow,such as a natural convection flow inside a dual window. This casewas based on the experiment by Betts and Bokhari [24]. Theymeasured the natural convection flow in a tall cavity of 0.076 mwide and 2.18 m high (Fig. 15). The cavity was deep enough so thatthe flow pattern was two-dimensional. The left wall was cooled at15.1 !C and the right wall heated at 34.7 !C. The top and bottomwalls were isolated. The corresponding Rayleigh number was0.86"106. A coarse mesh of 11"21 was applied. Fig. 16 comparesthe predicted velocity and temperature with the experimental dataat three different lines across the cavity. The results show that theFFD on the GPU gave reasonable velocity and temperature profiles.

Again, the results obtained by the FFD on the GPU differ from theexperimental data, but they are the same as those of the FFD on theCPU. The results lead to a similar conclusion as in the previouscases.

The above four cases show that the FFD code on the GPUproduced accurate results for lid-driven cavity flow and reasonableresults for other airflows. Due to the limitation of the FFD model,predictions by the FFD on the GPU may differ from the referencedata.

5.2. Comparison of the simulation speed

To compare the FFD simulation speed on the GPU with that onthe CPU, this study measured their computing time for the lid-driven cavity flow. In addition, this study also measured thecomputing time by the CFD on a CPU. A commercial CFD softwareFLUENT was used in the measurement. The simulations werecarried out on an HP workstationwith an Intel XeonTM CPU and anNVIDIA GTX 8800 GPU. The data was for 100 time steps but witha different number of meshes.

Fig. 17 illustrates that for both CFD and FFD, the CPU computingtime increased linearly with themesh size. The CFD on the CPUwas

0

0.2

0.4

0.6

0.8

1

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

y/H

U/Uin

x=H

GPUExperiment

0

0.2

0.4

0.6

0.8

1

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

y/H

U/Uin

x=2H

GPUExperiment

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5 3

U/U

in

x/H

y=0.028H

GPUExperiment

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5 3

U/U

in

x/H

y=0.972H

GPUExperiment

a b

dc

Fig. 14. a. Comparison of the horizontal velocity at x#H in forced convection predicted by the FFD on a GPU with experimental data [23]. b. Comparison of the horizontal velocity atx# 2H in forced convection predicted by the FFD on a GPU with experimental data [23]. c. Comparison of the horizontal velocity at y# 0.028H in forced convection predicted by theFFD on a GPU with experimental data [23]. d. Comparison of the horizontal velocity at y# 0.972H in forced convection predicted by the FFD on a GPU with experimental data [23].


5.3. Discussion

This study implemented the FFD solver for flow simulation onthe GPU. Since the FFD solves the same governing equations as theCFD, it is also possible to implement the CFD solver on the GPU byusing a similar strategy. One can also expect that the speed of CFDsimulations on the GPU should be faster than that on the CPU. Forthe CFD codes written in C language, the implementation will berelatively easy since only the parallel computing part needs to berewritten in CUDA.

Current GPU computing speed can be further accelerated byoptimizing the code implementation. The dimensions of GPUblocks can be flexible to adapt to the mesh. Meanwhile, manyclassical optimization techniques for paralleling computing are alsogood for GPU computing. For example, to read or write data fromGPU memory is time consuming, so the processors are often idledfor data transmission. One approach is to reuse the data already onthe GPU by calculating several neighboring mesh nodes with onethread.

In addition, the computing time can be further reduced by usingmultiple GPUs. For example, an NVIDIA Tesla 4-GPU computer has960 processors and 16 GB system memory [25]. Its peak perfor-mance can be as high as 4 Tetra FLOPS, which is about 10 timesfaster than the GPU used in this study. Thus, the computing time ofa problem with large meshes can be greatly reduced by usingmultiple GPUs.

6. Conclusions

This paper introduced an approach to conduct fast and infor-mative indoor airflow simulation by using the FFD on the GPU. AnFFD code has been implemented in parallel on a GPU for indoorairflow simulation. By applying the code for flow in a lid-drivencavity, a channel flow, a forced convective flow, and a naturalconvective flow, this investigation showed that the FFD on a GPUcould predict indoor airflow motion and air temperature. Theprediction was the same as the data in the literature for lid-drivencavity flow. The FFD on GPU can also capture major flow charac-teristics for other cases, including fully developed channel flow,forced convective flow and natural convective flow. But somedifferences exist due to the limitations of the FFD model, such aslack of turbulence model and simple no-slip wall treatment.

In addition, a flow simulation with the FFD on the GPU was 30times faster than that on the CPU when the mesh size was themultiplication of 256. If the mesh size cannot be exactly themultiplication of 256, the simulation was still 10 times faster than

that on the CPU. As a whole, the FFD on a GPU can be 500–1500times faster than the CFD on a CPU.

Acknowledgements

This study was funded by the US Federal Aviation Administra-tion (FAA) Office of Aerospace Medicine through the National AirTransportation Center of Excellence for Research in the IntermodalTransport Environment under Cooperative Agreement 07-CRITE-PU and co-funded by the Computing Research Institute at PurdueUniversity. Although the FAA has sponsored this project, it neitherendorses nor rejects the findings of this research. The presentationof this information is in the interest of invoking technicalcommunity comment on the results and conclusions of theresearch.

References

[1] United States Fire Administration. Fire statistics, http://www.usfa.dhs.gov/statistics/national/index.shtm; 2008.

[2] Chen Q. Design of natural ventilation with CFD. In: Glicksman LR, Lin J, editors.Sustainable urban housing in china. Springer; 2006. p. 116–23 [chapter 7].

[3] Nielsen PV. Computational fluid dynamics and room air movement. Indoor Air2004;14:134–43.

[4] Lin C, Horstman R, Ahlers M, Sedgwick L, Dunn K, Wirogo S. Numericalsimulation of airflow and airborne pathogen transport in aircraft cabins – part1: numerical simulation of the flow field. ASHRAE Transactions 2005:111.

[5] Megri AC, Haghighat F. Zonal modeling for simulating indoor environment ofbuildings: review, recent developments, and applications. HVAC&R Research2007;13(6):887–905.

[6] Chen Q. Ventilation performance prediction for buildings: a method overviewand recent applications. Building and Environment 2009;44(4):848–58.

[7] Stam J. Stable fluids. In: Proceedings of 26th international conference oncomputer graphics and interactive techniques, SIGGRAPH’99, Los Angeles;1999.

[8] Zuo W, Chen Q. Real-time or faster-than-real-time simulation of airflow inbuildings. Indoor Air 2009;19(1):33–44.

[9] Mazumdar S, Chen Q. Influence of cabin conditions on placement andresponse of contaminant detection sensors in a commercial aircraft. Journal ofEnvironmental Monitoring 2008;10(1):71–81.

[10] Hasama T, Kato S, Ooka R. Analysis of wind-induced inflow and outflowthrough a single opening using LES & DES. Journal of Wind Engineering andIndustrial Aerodynamics 2008;96(10–11):1678–91.

[11] Nvidia. Nvidia CUDA compute unified device architecture– programmingguide (version 1.1). Santa Clara, California: NVIDIA Corporation; 2007.

[12] Courant R, Isaacson E, Rees M. On the solution of nonlinear hyperbolicdifferential equations by finite differences. Communication on Pure andApplied Mathematics 1952;5:243–55.

[13] Chorin AJ. A numerical method for solving incompressible viscous flowproblems. Journal of Computational Physics 1967;2(1):12–26.

[14] Harris MJ. Real-time cloud simulation and rendering. Ph.D. thesis, Universityof North Carolina at Chapel Hill; 2003.

[15] Song O-Y, Shin H, Ko H.- S. Stable but nondissipative water. ACM Transactionson Graphics 2005;24(1):81–97.

[16] Zuo W, Chen Q. Validation of fast fluid dynamics for room airflow. In:Proceedings of the 10th international IBPSA conference, Building Simulation2007, Beijing, China; 2007.

[17] Rixner S. Stream processor architecture. Boston & London: Kluwer AcademicPublishers; 2002.

[18] Roosta SH. Parallel processing and parallel algorithms: theory and computa-tion. New York: Springer; 1999.

[19] Bertsekas DP, Tsitsiklis JN. Parallel and distributed computation: numericalmethods. Belmont, Massachusetts: Athena Scientific; 1989.

[20] Lewis TG, El-Rewini H, Kim I.- K. Introduction to parallel computing. Engle-wood Cliffs, New Jersey: Prentice Hall; 1992.

[21] Ghia U, Ghia KN, Shin CT. High-Re solutions for incompressible flow using theNavier–Stokes equations and a multigrid method. Journal of ComputationalPhysics 1982;48(3):387–411.

[22] Mansour NN, Kim J, Moin P. Reynolds-stress and dissipation-rate budgets ina turbulent channel flow. Journal of Fluid Mechanics 1988;194:15–44.

[23] Nielsen PV. Specification of a two-dimensional test case. Aalborg, Denmark:Aalborg University; 1990.

[24] Betts PL, Bokhari IH. Experiments on turbulent natural convection in anenclosed tall cavity. International Journal of Heat and Fluid Flow2000;21(6):675–83.

[25] NVIDIA, http://www.nvidia.com/object/tesla_computing_solutions.html; 2009.

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Number of Grids

Com

putin

g Ti

me

FFD on GPUFFD on CPUCFD on CPU

Fig. 17. Comparison of the computing time used by the FFD on a GPU, the FFD ona CPU, and the CFD on a CPU.


Not Perfect

Open!FOAMThe Open Source CFD Toolbox

User Guide

Version 1.624th July 2009

U-70 Applications and libraries

The problems that we wish to solve in continuum mechanics are not presented interms of intrinsic entities, or types, known to a computer, e.g. bits, bytes, integers. Theyare usually presented first in verbal language, then as partial di!erential equations in 3dimensions of space and time. The equations contain the following concepts: scalars,vectors, tensors, and fields thereof; tensor algebra; tensor calculus; dimensional units.The solution to these equations involves discretisation procedures, matrices, solvers, andsolution algorithms. The topics of tensor mathematics and numerics are the subjects ofchapter 1 and chapter 2 of the Programmer’s Guide.

3.1.2 Object-orientation and C++

Progamming languages that are object-oriented, such as C++, provide the mechanism— classes — to declare types and associated operations that are part of the verbal andmathematical languages used in science and engineering. Our velocity field introducedearlier can be represented in programming code by the symbol U and “the field of velocitymagnitude” can be mag(U). The velocity is a vector field for which there should exist,in an object-oriented code, a vectorField class. The velocity field U would then be aninstance, or object, of the vectorField class; hence the term object-oriented.

The clarity of having objects in programming that represent physical objects andabstract entities should not be underestimated. The class structure concentrates codedevelopment to contained regions of the code, i.e. the classes themselves, thereby makingthe code easier to manage. New classes can be derived or inherit properties from otherclasses, e.g. the vectorField can be derived from a vector class and a Field class. C++provides the mechanism of template classes such that the template class Field<Type> canrepresent a field of any <Type>, e.g.scalar, vector, tensor. The general features of thetemplate class are passed on to any class created from the template. Templating andinheritance reduce duplication of code and create class hierarchies that impose an overallstructure on the code.

3.1.3 Equation representation

A central theme of the OpenFOAM design is that the solver applications, written using theOpenFOAM classes, have a syntax that closely resembles the partial di!erential equationsbeing solved. For example the equation

!"U

!t+ ! • #U "! • µ!U = "!p

is represented by the code

solve(

fvm::ddt(rho, U)+ fvm::div(phi, U)- fvm::laplacian(mu, U)

==- fvc::grad(p)

);

This and other requirements demand that the principal programming language of Open-FOAM has object-oriented features such as inheritance, template classes, virtual functions

Open!FOAM-1.6

Chapter 1

Introduction

This guide accompanies the release of version 1.6 of the Open Source Field Operationand Manipulation (OpenFOAM) C++ libraries. It provides a description of the basicoperation of OpenFOAM, first through a set of tutorial exercises in chapter 2 and laterby a more detailed description of the individual components that make up OpenFOAM.

OpenFOAM is first and foremost a C++ library, used primarily to create executa-bles, known as applications. The applications fall into two categories: solvers, that areeach designed to solve a specific problem in continuum mechanics; and utilities, that aredesigned to perform tasks that involve data manipulation. The OpenFOAM distributioncontains numerous solvers and utilities covering a wide range of problems, as describedin chapter 3.

One of the strengths of OpenFOAM is that new solvers and utilities can be createdby its users with some pre-requisite knowledge of the underlying method, physics andprogramming techniques involved.

OpenFOAM is supplied with pre- and post-processing environments. The interfaceto the pre- and post-processing are themselves OpenFOAM utilities, thereby ensuringconsistent data handling across all environments. The overall structure of OpenFOAM isshown in Figure 1.1. The pre-processing and running of OpenFOAM cases is described

ApplicationsUser

ToolsMeshingUtilities Standard

ApplicationsOthers

e.g.EnSight

Post-processingSolvingPre-processing

Open Source Field Operation and Manipulation (OpenFOAM) C++ Library

ParaView

Figure 1.1: Overview of OpenFOAM structure.

in chapter 4 In chapter 5, we cover both the generation of meshes using the mesh gen-erator supplied with OpenFOAM and conversion of mesh data generated by third-partyproducts. Post-processing is described in chapter 6.

U-118 OpenFOAM cases

The syntax for each entry within solvers uses a keyword that is the word relating to thevariable being solved in the particular equation. For example, icoFoam solves equationsfor velocity U and pressure p, hence the entries for U and p. The keyword is followedby a dictionary containing the type of solver and the parameters that the solver uses.The solver is selected through the solver keyword from the choice in OpenFOAM, listedin Table 4.12. The parameters, including tolerance, relTol, preconditioner, etc. aredescribed in following sections.

Solver KeywordPreconditioned (bi-)conjugate gradient PCG/PBiCG†Solver using a smoother smoothSolverGeneralised geometric-algebraic multi-grid GAMG

†PCG for symmetric matrices, PBiCG for asymmetric

Table 4.12: Linear solvers.

The solvers distinguish between symmetric matrices and asymmetric matrices. Thesymmetry of the matrix depends on the structure of the equation being solved and, whilethe user may be able to determine this, it is not essential since OpenFOAM will producean error message to advise the user if an inappropriate solver has been selected, e.g.

--> FOAM FATAL IO ERROR : Unknown asymmetric matrix solver PCGValid asymmetric matrix solvers are :3(PBiCGsmoothSolverGAMG)

4.5.1.1 Solution tolerances

The sparse matrix solvers are iterative, i.e. they are based on reducing the equationresidual over a succession of solutions. The residual is ostensibly a measure of the errorin the solution so that the smaller it is, the more accurate the solution. More precisely,the residual is evaluated by substituting the current solution into the equation and takingthe magnitude of the di!erence between the left and right hand sides; it is also normalisedin to make it independent of the scale of problem being analysed.

Before solving an equation for a particular field, the initial residual is evaluated basedon the current values of the field. After each solver iteration the residual is re-evaluated.The solver stops if either of the following conditions are reached:

• the residual falls below the solver tolerance, tolerance;

• the ratio of current to initial residuals falls below the solver relative tolerance,relTol;

The solver tolerance should represents the level at which the residual is small enoughthat the solution can be deemed su"ciently accurate. The solver relative tolerance limitsthe relative improvement from initial to final solution. It is quite common to set thesolver relative tolerance to 0 to force the solution to converge to the solver tolerance. Thetolerances, tolerance and relTol must be specified in the dictionaries for all solvers.

Open!FOAM-1.6

/*---------------------------------------------------------------------------*\ ========= | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox \\ / O peration | \\ / A nd | Copyright (C) 1991-2008 OpenCFD Ltd. \\/ M anipulation |-------------------------------------------------------------------------------License This file is part of OpenFOAM.

OpenFOAM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

OpenFOAM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with OpenFOAM; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

\*---------------------------------------------------------------------------*/

#include "PCG.H"

// * * * * * * * * * * * * * * Static Data Members * * * * * * * * * * * * * //

namespace Foam{ defineTypeNameAndDebug(PCG, 0);

lduMatrix::solver::addsymMatrixConstructorToTable<PCG> addPCGSymMatrixConstructorToTable_;}

// * * * * * * * * * * * * * * * * Constructors * * * * * * * * * * * * * * //

Foam::PCG::PCG( const word& fieldName, const lduMatrix& matrix, const FieldField<Field, scalar>& interfaceBouCoeffs, const FieldField<Field, scalar>& interfaceIntCoeffs, const lduInterfaceFieldPtrsList& interfaces, Istream& solverData): lduMatrix::solver ( fieldName, matrix, interfaceBouCoeffs, interfaceIntCoeffs, interfaces, solverData ){}

// * * * * * * * * * * * * * * * Member Functions * * * * * * * * * * * * * //

Foam::lduMatrix::solverPerformance Foam::PCG::solve( scalarField& psi, const scalarField& source, const direction cmpt) const{ word preconditionerName(controlDict_.lookup("preconditioner"));

// --- Setup class containing solver performance data lduMatrix::solverPerformance solverPerf ( preconditionerName + typeName, fieldName_ );

register label nCells = psi.size();

scalar* __restrict__ psiPtr = psi.begin();

scalarField pA(nCells); scalar* __restrict__ pAPtr = pA.begin();

scalarField wA(nCells); scalar* __restrict__ wAPtr = wA.begin();

scalar wArA = matrix_.great_; scalar wArAold = wArA;

// --- Calculate A.psi matrix_.Amul(wA, psi, interfaceBouCoeffs_, interfaces_, cmpt);

// --- Calculate initial residual field scalarField rA(source - wA); scalar* __restrict__ rAPtr = rA.begin();

// --- Calculate normalisation factor scalar normFactor = this->normFactor(psi, source, wA, pA);

if (lduMatrix::debug >= 2) { Info<< " Normalisation factor = " << normFactor << endl; }

// --- Calculate normalised residual norm solverPerf.initialResidual() = gSumMag(rA)/normFactor; solverPerf.finalResidual() = solverPerf.initialResidual();

// --- Check convergence, solve if not converged if (!solverPerf.checkConvergence(tolerance_, relTol_)) { // --- Select and construct the preconditioner autoPtr<lduMatrix::preconditioner> preconPtr = lduMatrix::preconditioner::New ( *this, controlDict_.lookup("preconditioner") );

// --- Solver iteration do { // --- Store previous wArA wArAold = wArA;

// --- Precondition residual preconPtr->precondition(wA, rA, cmpt);

// --- Update search directions: wArA = gSumProd(wA, rA);

if (solverPerf.nIterations() == 0) { #ifdef ICC_IA64_PREFETCH #pragma ivdep #endif

for (register label cell=0; cell<nCells; cell++) { #ifdef ICC_IA64_PREFETCH __builtin_prefetch (&pAPtr[cell+96],0,1);

__builtin_prefetch (&wAPtr[cell+96],0,1); #endif

pAPtr[cell] = wAPtr[cell]; } } else { scalar beta = wArA/wArAold;

#ifdef ICC_IA64_PREFETCH #pragma ivdep #endif

for (register label cell=0; cell<nCells; cell++) { #ifdef ICC_IA64_PREFETCH __builtin_prefetch (&pAPtr[cell+96],0,1); __builtin_prefetch (&wAPtr[cell+96],0,1); #endif

pAPtr[cell] = wAPtr[cell] + beta*pAPtr[cell]; } }

// --- Update preconditioned residual matrix_.Amul(wA, pA, interfaceBouCoeffs_, interfaces_, cmpt);

scalar wApA = gSumProd(wA, pA);

// --- Test for singularity if (solverPerf.checkSingularity(mag(wApA)/normFactor)) break;

// --- Update solution and residual:

scalar alpha = wArA/wApA;

#ifdef ICC_IA64_PREFETCH #pragma ivdep #endif

for (register label cell=0; cell<nCells; cell++) { #ifdef ICC_IA64_PREFETCH __builtin_prefetch (&pAPtr[cell+96],0,1); __builtin_prefetch (&wAPtr[cell+96],0,1); __builtin_prefetch (&psiPtr[cell+96],0,1); __builtin_prefetch (&rAPtr[cell+96],0,1); #endif

psiPtr[cell] += alpha*pAPtr[cell]; rAPtr[cell] -= alpha*wAPtr[cell]; }

solverPerf.finalResidual() = gSumMag(rA)/normFactor;

} while ( solverPerf.nIterations()++ < maxIter_ && !(solverPerf.checkConvergence(tolerance_, relTol_)) ); }

return solverPerf;}

// ************************ vim: set sw=4 sts=4 et: ************************ //

Precondition Conjugate Gradient Solver

The End ^_^

fluid dynamics and gpu - seas.upenn.educis565/lecture2010/fluidpresentation.pdf · 3 2 traditional...

Documents