dataparallel navier- stokes solutions on · 2014. 5. 12. · extrapolation) solver [4, 5, 10]...

Dataparallel Navier- Stokes solutions on

different multiprocessors

T. Michl", S. Maier", S. Wagner", M. Lenke\ A. Bode&

®Institut fur Aerodynamik und Gasdynamik,

Universitdt Stuttgart, Pfaffenwaldring 21, W-7000

Stuttgart 80, Germany

^Lehrstuhl fur Rechnertechnik und

Rechnerorganisation, TU Miinchen, Arcisstrafte

ABSTRACT

The here presented work shows an efficient parallel implementation of a3-D Navier-Stokes solver. The code is easily transferable. The numericsremained all vectorized, enabling a vector as well as a parallel run on multi-ple computer systems. The implicit solver works blockwise. This techniquesaves memory in case of a sequential run and enables high speed-ups in caseof a parallel run. Parallelization remains coarse grain for any number ofnodes that are introduced.

INTRODUCTION

Present research and development in aerospace science disclose the strongneed for more computing power and - not least - memory. Traditionalsuper vector computers more and more reach physical limitations and areincreasingly expensive. A different approach to obtaining more computingpower and storage has been the development of multiprocessor systems withpowerfull RISC processors. This technology promises scaleable computingpower and memory at moderate prices.

This paper deals with the parallelization of an industrial 3-D Navier-Stokessolver on multiprocessors of different architectures and programing models.With the development of general message passing calls which are realized onthe particular multiprocessor through the available communication model,the code is easily transferable. Thus the code is as independent as possi-ble from the machine architecture, which is important if one considers thedynamic market situations in that field. All loops and the numerics re-mained fully vectorizeable, which guarantees a good performance on vector

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

264 Applications of Supercomputers in Engineering

computers as well. Therefore the code can demonstrate the properties ofmultiprocessors in comparison to vector supercomputers for certain appli-cations.

Characteristics and performance of the code on different multiprocessorsystems as well as on nowadays vector supercomputers are demonstratedfor a transsonic viscous flow around a profile and the hypersonic inviscidflow around a 3-D conus.

EULER AND NAVIER-STOKES SOLVER

The basic solver used here to simulate compressible inviscid and viscousflows is the NSFLEX (Navier-Stokes solver using characteristic FLuxExtrapolation) solver [4, 5, 10] developed by DASA-LM (Branch: Theo-retical Aerodynamics - Numerical Methods). For computing inviscid flows,the viscous terms of the solver can be switched off.

The code has been reviced and modified by the authors.

Governing Equations

The basic equations are the time dependent Reynolds-averaged compress-ible Navier-Stokes equations. Conservation laws are used with body-fittedarbitrary coordinates £, 77, £ building a structured Finite Volume grid withcartesian velocity components %, u, w:

= 0, (1)

where the solution vector U is defined as:

U = J(p,pu,pv,pw,ef,

J being the cell volume. The fluxes normal to the £, 77, £ faces are:

E =F =

G =

E,F,G are the cartesian flux vectors in the x,y,z directions:

(2)

(3)

E =

pupu* - orxpUV — VpUW — G


Applications of Supercomputers in Engineering 265

F =

G =

pvpVU — CTyx

pVW — Vyz

pwpWU — &zxpWV — &zypW* — GZZ

The stress tensor is defined as:

+

+ q*

(4)

+Uz), (5)

and the heat flux vector as:

= -kTy, = -kT,. (6)

The letters />, p T, ^, fc denote density, static pressure, temperature, vis-cosity coefficient and heat conductivity coefficient, respectively. The in-dices (){, ()„, ()(, 0*5 ()y> ()z denote partial derivatives with respect to<f, 77, (, x, y, z except for the stress tensor a and the heat flux vector q.In case of real gas calculation, effective transport coefficients are introducedwith the Boussinesq approximation. In case of turbulent flow, an algebraicturbulence model of Baldwin & Lomax [3] is employed.

Implicit Numerical Method

NSFLEX uses an implicit procedure for time integration [10]. The firstorder time discretization from equation (1) leads to

rn+l _ \> + + + = 0. (7)

Linearizing the fluxes in equation (7) by the Newton method over the knowntime level n:

, (8)



where A, 5, C are the Jacobians to the fluxes E, F, G, leads to:

(9)RHS

The divergence of characteristically extrapolated fluxes on the right-hand-side (RHS) can be split in a viscous and inviscid part. The inviscid partis computed by mixing a Riemann solver (two versions are available, one ismore robust [5] whereas the other converges faster [4]) and a modified Steger& Warming Scheme [10] - alternatively a modified van Leer Scheme [6] thatconverges much faster for supersonic and hypersonic flows. Equation (9) hasto be approximately solved at every time step. This is done by a block red-black Gaufi-Seidel algorithm. The terms (A"AE% (B"A[/)q, (C"AC/)(on the left-hand-side of equation (9) are discretized at i, j, k up to secondorder in space.

The obtained terms can be combined to a system of equations:

DIAGl̂ Uftl = -uRHS?M + ODIAGiu (10)

where DIAG is a 5 x 5 matrix of the sum of the Eigenvalue split intoinviscid and viscous thinlayer Jacobian with At/j,j,fc being the factor togetherwith ^. The vector ODIAG is a function of At/i+ij,*, Af7*_ij,&, A[/,-,j+i,fc>A£/,-j_i,jfe, At/ij,fc+i, At/,-j,fc_i, where the actual AC/ values of the //-ierationare taken, w is an underrelaxation parameter.

PARALLEL IMPLEMENTATION

Strategy

The main objective for the development of the present parallel solver hasbeen simple transferabilty. It should be possible to run the same solver onparallel systems and traditional supercomputers. High efficiencies shouldbe obtained on both kinds of systems. Thus the solver should be fullyvectorizeable and should exchange as little data as possible in order to alsoachieve good efficiencies on systems with slow networks. And over all, thenumerical efficiency shouldn't decrease.

The multiblock strategy where the grid is decomposed in blocks, satisfiesthese specifications. The implicit solver runs independently on each block.The update of the overlapping boundaries is done after each timestep. Onlarge problems, this technique is also convenient for a sequential run sinceall datafields for the implicit part must only be allocated for the largestblock and not for the whole problem. Therefore, a lot of memory can besaved.

A cubix programming model had been chosen, where the same program runs



on each node. A host program is only necessary if a special distribution ofprograms to nodes is wanted or if instance numbers are not definitely setwith the load of the program. Otherwise each node program identifies itsblock number with an instance number. Interconnections to other blocksare given with the grid. Each block analyses the interconnections and dy-namically builds up the communication requirements for the given problem.

An interface translating general send and receive statements to machinespecific message passing calls has been developed. This interface is the onlymachine specific part of the solver.

More detailed information about the parallel strategy and the characteristicof the solver can be obtained from [7] and [8].

Mailbox Communication Model

Two routines handle the communication. The first one scans the overlap-ping boundaries of its domain and creates a 1-D receive table. This tablecontains all information needed for communication, such as expected datafrom neighbouring blocks and their correlation with the local block. Thisinformation is packed in demands that are sent to the according processes.With these demands every process creates a 1-D send table. These tablesallow a simple and safe control of all messages that are to be sent or tobe received. The second routine sends and receives all messages during the

iterative process.

Both routines use general receive and send calls which are the same forsequential or parallel runs. Differences are introduced in the interface whichtranslates send and receive calls to copy (in the sequential run) or messagepassing statements (in the parallel run).

The developed mailbox model allows the computation of any blockstruc-tured grid, as there are split C-(C-)type, H-(H-)type, or Lego-type grids.The I, J, K orientation of the blocks is arbitrary. Several independent con-nections between two blocks are possible. Hence, one can say the solver isworking on structured grids but unstructured blocks.

The exchanged messages are the conservative solution vectors and theirdifferences at the overlapping cells.

MODEL PROBLEMS

Trans sonic Flow around a Profile

The viscous flow around a CASTT-profile was computed with the specifi-cation given in table (1).

The C-type Navier-Stokes grid has 254 x 64 active cells. For the calcu-lation on parallel machines the grid is split in 4, 8, 16, 32, 66 blocks ofapproximately equal numbers of grid cells. The grid is split along lines in



One block Four blocks

Eight blocks Sixteen blocks

Figure 1: Blockstructured CAST 7 Grids (Details).

Table 1: Specification for Viscous Transonic Flow Computation around aCAST 7 Profile.

Flow typeMach numberReynoldsnumberAngle of attackTransition laminar-turbulent (at x/L)

V0620

is co.70•10.0°.07

us

6

I-direction (figure 1).

Figure (2) shows the Cp (Experiment: [2]) and Mach number distributioncomputed on a 16 block grid on the Intel Paragon XP/S 5. The block-boundary-interchange is smooth.

There is no significant difference in iterations between the number of blocks.Also no instabilities had been discovered if the overlapping block boundarylies close to or within the shock. In the presented 16 block grid the shock liesclose to the cell-face of the first I-cellrow of the 12-th block. The compu-



o ExperimentSteger & Warming

- - van Leer

Figure 2: Viscous Flowfield for CAST 7.

tation was done with the modified van Leer Scheme which is tendenciouslyunstable in transsonic now but computes the shock location very accuratelyin this example. The use of the modified Steger & Warming Scheme leadsto a shock location which is a bit behind the experiment but is more stablefor that kind of flow condition. The Cp plateau in the supersonic region is abit higher for the Steger & Warming Scheme and closer to the experiment.

Hypersonic Flow around a 15" Conus

Table 2: Specification for inviscid hyperssonic flow computation araound a

15" conus.Flow typeMachnumberAngle of attackAngle of yaw

inviscid10.010.0"0.0"

3-D message passing is evaluated with the flow around a 15" conus inhypersonic speed. Specification is given in table (2).

This problem demonstrates message passing between arbitrarily orientedblocks (figure 3). The given H-H type grid was not especially created forthe computation on a multiprocessor. It only followed the demands for fastpredesign Euler solvers but it is also very suitable for this more accuratesolver. The splitting of this particular grid in a blockstructured grid is doneby a small tool which devides H-H type grids into 6 overlapping blocks. Dueto algorithm properties, each block is oriented with the K-axis orthogonalto the surface. That way each block is oriented differently in space. Alsomultiple connections between two blocks occur.



Figure 3: Flow around a 15° Conus in Hypersonic Speed.

Flowfield results for that problem can be seen in figure (3). The shocklocations agree very well with experiments [9].

Block-boundary interchanges are smooth, besides the Jost line effects onthe surface. But this is a grid property and no effect from message passing.

Computation was done in parallel with the modified van Leer flux vectorsplitting which is convenient for hypersonic flows because the shock locationis computed faster than with the Steger & Warming Scheme.

PERFORMANCE

Systems

The used vector and multiprocessor systems and their peak performancescan be seen in table (3).

CAST 7

Figure (4) shows the required CPU-time for 500 iterations on CAST 7 (thewhole problems takes about 4000 Iterations). One can see, that alreadyrelatively small parallel systems are able to achieve the performance of tra-ditional supercomputers. Comparing the Intel iPSC/860 with its succesor



Table 3: Systems.

Type:Nodes:Peak-performanceper node:Peak-performanceof the system:

Cray 2 Cray Y-MP C90 NEC SX3-44Vector

4488 MFLOPS

1 952 MFLOPS

16 (1 used)1 000 MFLOPS

16 000 MFLOPS

4 (1 used)5 500 MFLOPS

22 000 MFLOPS

Type:

Nodes:Peak-performanceper node:Peak-performanceof the system:Communication:

IBM RS/6000/550Cluster

Intel iPSC/860 Intel ParagonXP/S5

ParallelDistributed Memory

5 (+ 1 Host)83 MFLOPS

415 MFLOPS

PVM

3280 MFLOPS(Double Precision)1 920 MFLOPS(Single Precision),NX/860

72100 MFLOPS

7 200 MFLOPS

NX/860; OSF/1

Type:

Nodes:Peak-performanceper node:Peak-performanceof the system:Communication :

Alliant FX/2800Par;

Shared Memory

8 (+ 1 I/O)80 MFLOPS(Double Precision)540 MFLOPS(Single Precision)Shared Region

nCUBE 2illelDistributedMemory643.4 MFLOPS(32-bit)217.6 MFLOPS(32-bit)Vertex

Intel Paragon, improved runtime and compiler systems are obvious. At thepresent state though, the Intel Paragon is not able to achieve the perfor-mance of the NEC, although the theoretical performance of the Paragonis a bit higher. But the performance on the Paragon might improve whenthe system finally uses all hardware and software options - such as softwarepipelining, mikrokernel and so on. One must also respect that some timeconsuming routines of the NEC code have been especially adapted to thismachine (benchmark code).

Characteristics and efficiency of a parallel implementation can be seen inspeed-up plots (figure 5). The speed-up is defined as:

su =CPU —time of the sequential run

(ii)CPU —time of the parallel run .

An exception to this rule must be made for the PVM-version on the IBM



Machine Symbol Machine SymbolCray 2 Cray Y-MP C90NEC SX3-44 Alliant FX/2800 XIntel iPSC/860 Intel Paragon XP/S 5IBM RS/6000/550 Cluster

g40000

30000

20000

8 16 24 32 40 48 56 64 72- Nodes/Blocks

g 15000

i ilooor

500'

I 16 24 32 40 48 56 64 72Nodes / Blocks

a) All machines b) Fastest machines

Figure 4: Required CPU-time for 500 Iterations on CAST 7.

cluster. PVM hands the communication to a deamon, whose CPU-timeis not part of the nodeprogram's CPU-time. This would lead to incorrectspeed-up's. Therefore PVM-versions elapsed times are shown.

Machine Symbol Machine SymbolCray 2 Intel iPSC/860AlliantSemaphores

FX/2800, AlliantLockwaits

FX/2800, A

Intel Paragon XP/S 5 IBM RS/6000/550 Cluster,Tokenring

IBM RS/6000/550 Cluster,Ethernet

IBM RS/6000/550 Cluster,SOC

24 32 40 48 56 64• Blocks / Nodes

12 16Stocks/Nodes

a) All machines b) Detail

Figure 5: Speed-Up on Multiprocessors for CAST 7.



On the Alliant FX/2800 two synchronization techniques have been imple-mented. One is based on lockwaits which put a process into a busy waitingstate. This technique is recommended by Alliant. The other technique isbased on semaphores which put a process into a sleeping state where itdoesn't use CPU-time. But the number of semaphores is limited by the op-erating system. This restriction can be managed though if synchronizationis handled by only one process which acts as a kind of master. Figure (5)shows that this technique is more efficient.

The Alliant FX/2800 is a shared memory multiprocessor where the pro-cesses access memory via a crossbar interconnect and a global cache. Withthe increase of processes, the use of global cache becomes more and moreineffective. This leads to a decrease in parallel efficiency. The 16 and 32block runs are done on an 8 node machine (multitasking).

At all different is the behaviour of the code on the IBM RS/6000/550 Clus-ter, where superlinear speed-up's are achieved. This seems to be a contra-

diction to Amdahl's law:

su =' sequential (12)

J- communication \-I computation, p

where P is the number of compute nodes. Assuming communication timeas 0 and the sum of computation time of all processes equal to the sequen-tial execution time (which means no double computations), the maximumspeed-up could be linear. But Amdahl's law doesn't respect certain algo-rithm properties and cache effects.

r

BranProces.

Un1

*Fixed

Point Unit

t tI/O Registersand Devices

Data (

I

Main Me

chsingit

Zache

j

mory

*FloatirPoint U

t

[64KB]

InstructionCache Unit

]gnit

Figure 6: Logical View of IBM RISC System/6000/550 Architecture.

On the IBM RS/6000, data is shuttled between main memory and theprocessing units through a data cache (figure 6). A data cache miss means



a delay of about 8 cycles [1] whereas data in the cache can be accessedwithin one cycle. Keeping in mind that a simultanious add and multiplyinstruction takes one cycle, a data cache miss can decrease the performancesignificantly. On a cluster of IBM RS/6000 workstations, every additionalnode enlarges the cache of the whole system while the overall problem sizeremains about the same. Therefore the probability of cache misses decreasesignificantly. The problem size of CAST 7 is about 4 MByte while the datacache on the 8 node system is 512 KByte.

The influence of data cache misses on the performance was also evalu-ated with a simple matrix-matrix multiply. The results achieved on theIBM RS/6000/550 are plotted in figure (7). The performance is constantlyon a high level up to a matrix size of 200 or 960 KByte problemsize, re-spectively. The increasing amount of data cache misses lead to a dramaticdecrease in performance and approaches a constant MFLOP rate from amatrix size of about 350.

Additionally, our parallel implementation exchanges rather little data - withaccordingly short transfer times - compared to the actual computation.Another algorithm property, described for the Intel iPSC/860 runs below,might furthermore promote superlinear speed-up's. Respecting these prop-erties, message passing doesn't remarkably effect the performance. Fig-ure (5) also demonstrates the influence the bandwith of the interconnec-tions between the nodes has. The lowest speed-ups on the IBM cluster areachieved with an Ethernet (having a transfer rate of 10 Mbit/s), the bestwith SOC (Serial Optical Channels, having a transfer rate of 220 Mbit/s).Taking into account that an Ethernet can handle only one data packageat a time whereas message passing between all nodes is usually done syn-chronously, an Ethernet must be worse since each process is delayed. ATokenring can also handle only one data package at a time, but it is en-sured that this data package will be transmitted successfully when the lineis set. An Ethernet allows all processes to access the line at one time. Butif that happens, all processes that accessed the line interrupt their datatransfer and try it somewhat later. Thus data transfer is not as efficientas with a Tokenring. Additionally the bandwidth of a Tokenring is higher(16 Mbit/s). SOC can be seen as virtual fully intermeshed connections.Hence, SOC do not suffer under the restriction of an Ethernet or Tokenringand exchange data much faster.

First tests on the Intel Paragon show superlinear speed-up, too. In figure (5)the speed-up is referenced to the 16 node run. The 66 node run would leadto a speed-up of about 69. But one should respect that at the present state,the whole operating system (OSF/1 T 8) is running on every node andtherefore not much memory is left for application processes. The system ispaging for sequential and mild parallel runs. Also cache effects may supportimproved performance with more processors introduced.



CO 30

2 25

I*

15

10

5

9*r 150 200 250 300 350 400Dimension

Figure 7: Performance of a Matrix-Matrix Multiplication on theIBM RS/600/550 (Double Precision).

So speed-up measurements will be more reliable with the shipping of themicrokernel.

Some more details might give a better view of the Paragon's performance.The grid for example, has not especially been created for that run. Dividingthe 254 active cellrows in I into 66 blocks creates stripes which are 3 or 4cellrows wide. Thus, due to block inbalances and neglecting cache effectsand delays from message passing the expected speed-up could not be greaterthan about 63. The 66 node run is only working with a sequentialized I/Owhere all the I/O is handled by one service node. Therefore the I/O of sucha small problem (550 [s] CPU-time) takes about 20 % of the time. Another20 % of the time is spent at synchronization which leads to a total of 35 %of CPU-time spent with message passing. Those effects might be improvedthough with new releases of the operating system.

The runs on the Intel iPSC/860 had been made with a synchronous commu-nication routine and not with the asynchronous one of the Paragon. Withsynchronous communication more time is spent during message passing. Onthe Intel iPSC/860 the Portland compiler version 2 has been used while onthe Intel Paragon the version 4 is installed. Recent test with the version4 compiler on the Intel iPSC/860 reduced the CPU-time by a factor 2 to3. The speed-up is referred to the 4 node run. Due to a lack of memory,a sequential run is not possible. Speed-up plots again show superlinearity.One reason might be that the time spent for message passing decreases withthe increase of nodes (4 nodes: 952 [s], 8 nodes: 543 [s], 16 nodes: 234 [s],32 nodes: 205 [s]) although the amount of exchanged messages of each blocksis about the same. This effect can be explained with an algorithm property.The solver starts working with 5 cellrows in k-direction and expands thedomain it is working on when a significant change in the last k-cellrow isdetected. For a relatively long time this is not true for blocks lying far away



from the profile (figure 1). These blocks compute less and are then waitingat semaphores for data transfer. Thus, blocks which do more computationsare obstructed less. It is also remarkable, that I/O is a lot faster on theIntel iPSC/860 than on the Paragon.

15° Conus

A parallel computation for that flow was only done on the IBM RS/6000/500Cluster, because it has enough memory for a one-block run.

As discribed above, the given H-H type grid has been split into 6 blocks,where the biggest has 83 950 active cells, the smallest 38 500. The totalnumber of cells is 272 976. Thus, the maximum theoretical speed-up dueto block inbalances (neglecting cache effects) is about 3.25.

500 Iterations take about 64 850 [s] in the sequential run. The parallel runon the cluster - interconnected by an Ethernet - took 19 100 [s]. Hence, thespeed-up is 3.40. Superlinearity again is a result of cache effects.

This has also been proofed with a 3-D computation on a half sphere, wherethe grid has been scaled with the number of processors in such a way thateach processor works on the same problemsize. The test leads to the ex-pected sublinear speed-up due to message passing [8].

CONCLUSION

The parallel implementation has proofed its transferability to any com-puter system. Multiblock strategy and the developed mailbox programmingmodel allow the computation of any blockstructured 2-D or 3-D grid. Theorientation of blocks is arbitrary, as well as the number of interconnections.

Good performance had been achieved on super vector computers and paral-lel systems, respectively. Benchmarks with a 2-D Navier-Stokes computa-tion showed that parallel systems of already small configurations are able toachieve execution times comparable to nowadays super vector computers.

Numerical efficiency was preserved, too. Splitting the domain into blocksand executing the red-black Gaufi-Seidel solver independently on each blockhas no remarkable effect on the convergence rate.

Future research will now focus on the acceleration of the numerics itself.

ACKNOWLEDGEMENT

This work is supported by the DFG (Deutsche Forschungsgemeinschaft)(Wa 424/9, Bo 818/2).

Thanks also to RUS, Stuttgart and IBM, Heidelberg, who enabled somebenchmarking for us.

The conus grid has been created by Roland Losch (IAG, Stuttgart).



References

[1] AIX Version 3.2. Optimization and Tuning Guide for the XL FOR-TRAN and XL C Compilers. IBM Canada Ltd. Laboratory, NorthYork, Ontario, 1992.

[2] P. Archambaud, J., A. Mignosi, and A. Seraudie. Rapport d'essais surprofil CAST 7 effectues a la soufflerie T2 en presence de parois auto-adaptables en liaison avec le groupe GARTEUR AG (AG02). Rap-port Technique OA no. 24/3075 (DERAT no. 7/5015 DN), O.N.E.R.A.,Toulouse Cedex, 1982.

[3] B. S. Baldwin and H. Lomax. Thin Layer Approximation and AlgebraicModel for Separated Turbulent Flow. AIAA-Paper 78-0257, 1978.

[4] A. Eberle. Enhanced Numerical Inviscid and Viscous Fluxes for CellCentered Finite Volume Schemes. MBB-LKE-S-PUB-140, 1991.

[5] A. Eberle and M. A. Schmatz. High Order Solutions of the EulerEquations by Characteristic Flux Averaging. ICAS-86-1.3.1., 1986.

[6] D. Hanel and R. Schwane. An Implicit Flux Vector Splitting Scheme forthe Computation of Viscous Hypersonic Flows. AIAA 89-0274, 27thAerospace Science Meeting, Reno, Nevada, 1989.

[7] M. Lenke, A. Bode, T. Michl, and S. Wagner. Implicit Euler Solveron Alliant FX/2800 and Intel iPSC/860 Multiprocessors. In: Noteson Numerical Fluid Mechanics, Vol. 38 (Ed.: E. H. Hirschel), Vieweg,Braunschweig, 1993.

[8] T. Michl, S. Maier, S. Wagner, M. Lenke, and A. Bode. A Data ParallelImplicit 3-D Navier-Stokes Solver on Different Multiprocessors. Paperpresented at the "Parallel CFD '93", Paris, to appear.

[9] V. Rakich, J. and W. Cleary, J. Theoretical and Experimental Study ofHypersonic Steady Flow around Inclined Bodies of Revolution. AIAAJournal, Vol. 8, No. 3, March 1970.

[10] M. A. Schmatz. Three Dimensional Viscous Flow Simulation Using anImplicit Relaxation Scheme. MBB-LKE-S-PUB-309, 1987.


dataparallel navier- stokes solutions on · 2014. 5. 12. · extrapolation) solver [4, 5, 10]...

Documents