direct numerical simulation of turbulence with a pc/linux ... · mpich and lam are used on...

Direct Numerical Simulation of Turbulence with aPC/Linux Cluster: Fact or Fiction?

G-S Karamanos, C. Evangelinos, R.C. Boes,R.M. Kirby and G.E. Karniadakis∗

Division of Applied Mathematics

Brown University, Providence, RI 02912, USA

Abstract

Direct Numerical Simulation (DNS) of turbulence requires many CPU days and Gigabytes ofmemory. These requirements limit most DNS to using supercomputers, available at supercom-puter centres. With the rapid development and low cost of PCs, PC clusters are evaluated as aviable low-cost option for scientific computing. Both low-end and high-end PC clusters, rang-ing from 2 to 128 processors, are compared to a range of existing supercomputers, such as theIBM SP nodes, Silicon Graphics Origin 2000, Fujitsu AP3000 and Cray T3E. The comparisonconcentrates on CPU and communication performance. At the kernel level, BLAS libraries areused for CPU performance evaluation. Regarding communication, the free implementations ofMPICH and LAM are used on fast-ethernet-based systems and compared to myrinet-based andsupercomputer networks. At the application level, serial and parallel simulations are performedon state of the art DNS, such as turbulent wake flows in stationary and moving computationaldomains.

1 Introduction

1.1 Historic Perspective

Direct Numerical Simulation (DNS) computes flow fields, resolving all scales from first principles,without any ad hoc modelling assumptions. It was initiated by Orszag and Patterson (1972) whoobtained accurate simulations of wind-tunnel flows at moderate Reynolds numbers. This first sim-ulation was performed on the CDC 7600, with limited memory and 50Mflop/s peak speed. Sincethat time, DNS of channel flows, riblets, cylinder flows and flexible cylinders has been performed,providing insights unavailable and in some cases impossible to obtain with experimental techniques.The bane of DNS, though, is its high computational cost. The Navier-Stokes equations are three-dimensional, time-dependent and non-linear partial differential equations that have to be solved atall scales for which there is appreciable kinetic energy. It has been shown that as Reynolds number,Re, increases, the number of degrees of freedom necessary to accurately simulate a fully turbulentflow i! ncreases as Re9/4, while the total cost of the computation increases as Re3(Karniadakis& Orszag 1993). Typically, turbulent flows correspond to Re greater than 1000, with geophysicalflows atRe > 105. For DNS of flow past an airfoil at moderate Reynolds numbers O(10 8) a teraflopcomputer would be required, while DNS of a complete aircraft would require the use of an exaflop(1018) computer.

A review of the parallel DNS was presented in 1993 (Karniadakis & Orszag 1993), along with aproposal for future developments in DNS: the design of a parallel prototype computer (PPC) with a∗Author for correspondence

1Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SC ’99, Portland, OR (c) 1999 ACM 1-58113-091-8/99/0011 $3.50

hybrid distributed/shared memory of 1000 processors, achieving 1 Teraflop for a load balanced tur-bulence simulation of 10243 resolution. The main concept of this proposal has been realised as partof the developments of the Advanced Strategic Computing Initiative supported by the Departmentof Energy. The HP/Convex Exemplar, the SGI Origin 2000, and the new IBM SP2 have featuressimilar to the PPC, providing high efficiency in DNS. With target simulations of billions of gridpoints, on 100 Teraflop computer systems achievable in the next couple of years, the current focusis thus turning towards the next generation of Petaflop (10 15 flops) systems.

Despite these developments, current DNS is limited to low Reynolds numbers depending on thegeometry simulated. In simple geometries, the highest, Re, flow DNS computed is for Re = 12500(Moser, Kim & Mansour 1999), while cylinder flow DNS has been computed for Reynolds numbersas high as 4000. To put things into perspective, the latter simulation ( Xia, Karamanos & Karniadakis1999), performed at the National Center for Supercomputing Applications, used 128 processors, atotal of 30GBytes of memory and 100 million degrees of freedom (d.o.f). It required 250 hoursof CPU time per processor, i.e. 32000 hours in total, or 3.6 CPU years. Smaller simulations havebeen done, though they have been computed for lower Reynolds numbers on the order of 1000, andmostly in simply-connected geometries.

DNS is performed on supercomputers available at supercomputing centres, such as the Maui HighPerformance Computing Center (MHPCC), the National Center for Supercomputing Applications(NCSA), and Major Shared Resource Center of the Naval Oceanographic Office (NAVO). Due tothe high cost of ownership, it is unlikely that individual institutions will maintain state of art super-computers, relying instead on the facilities of national and/or state centers. A new trend, based onthe low cost and high performance of PCs, is emerging. Using various operating systems, clustersof PCs may be assembled using off-the-shelf components (Becker et al.1995, Reschke et al.1996,Ridge et al.1997). The cost of these clusters is low compared to that of supercomputers, and initialtests indicate that PC clusters may demonstrate acceptable performance. Currently two main operat-ing systems are popular for PCs: Linux and Microsoft Windows NT. Apart from experimental N! Tclusters (NCSA, Cornell Theory Center), the majority of emerging PC clusters are using Linux dueto the availability of Linux source code, good system stability, ease in remote system managementand compatibility with other UNIX systems.

1.2 Objective

The objective of this paper is to investigate the applicability and practicality of using a PC/Linuxcluster (denoted as PC cluster in this paper), for DNS of turbulence. A low-budget (less than$10,000) 4-PC cluster is compared to high-end Linux clusters, such as the RoadRunner PC/Linuxcluster at the Albuquerque High Performance Computing Center (AHPCC) and to some widely usedsupercomputers. The investigation will concentrate on two levels:

• Kernel level comparisons:

– Computational kernels: BLAS routines account for most of the work in a DNS. Matrix-matrix/ matrix-vector products and vector copy are tested, among others, to evaluatesingle node performance.

– Communication kernels: MPI functions are tested to evaluate latency/ bandwidth andMPI Alltoall performance.

• Application level comparisons:

– Single node performance: Serial algorithms are tested on the DNS of turbulent bluffbody flows.

– Multiple node performance: Parallel algorithms are tested with the DNS of turbulentbluff body flows on both spatially fixed and moving geometries.

2

1.3 Numerical Method

A spectral/hp element numerical discretisation will be used (Karniadakis & Sherwin 1999) for thesolution of the Navier-Stokes equation, using serial and parallel algorithms (Warburton 1999). Onthe latter type, two prototype cases are available for non-separable and multiply-connected domain-s: the first corresponds to physical situations where one direction is homogeneous, thus Fourierexpansions may be used, while the non-homogeneous plane is resolved using the spectral/hp ele-ment method. The second case uses a 3-dimensional spectral/hp element method and is applicableto computational domains with all directions non-homogeneous. An intrinsic element based domaindecomposition is used, where the distribution of work among processors allows for a high degree ofparallelism.

The application level tests will be based on N εκT αr , which is a general purpose CFD code forsimulating incompressible, compressible and plasma flows in unsteady three-dimensional geome-tries. The major algorithmic developments are described in Sherwin & Karniadakis (1995) andWarburton (1999). The code uses meshes similar to standard finite element and finite volume mesh-es, consisting of structured or unstructured grids or a combination of both. The formulation is alsosimilar to those methods, corresponding to Galerkin and discontinuous Galerkin projections for theincompressible and compressible Navier-Stokes equations, respectively. Field variables, data andgeometry are represented in terms of hierarchical (Jacobi) polynomial expansions (Sherwin andKarniadakis, 1995); both iso-parametric and super-parametric representations are employed. Theseexpansions are ordered in vertex, edge, face and interior (or bubble) modes. For the Galerkin for-mulation, the required C 0 ! continuity across elements is imposed by choosing appropriately theedge (and face in 3D) modes; at low order expansions this formulation reduces to the standard finiteelement formulation. The discontinuous Galerkin is a flux-based formulation, and all field variableshave L2 continuity; at low order this formulation reduces to the standard finite volume formulation.This new generation of Galerkin and discontinuous Galerkin spectral/hp element methods imple-mented in the code N εκT αr do not replace but rather extend the classical finite element andfinite volumes that the CFD practitioners are familiar with (Karniadakis and Sherwin, 1999). Theadditional advantage is that convergence of the discretization and thus solution verification can beobtained without remeshing (h-refinement) and that the quality of the solution does not depend onthe quality of the original discretization.

2 Hardware Compared

Comparisons between 10 machines are made, namely:

1. NSF Alliance computer at Albuquerque High Performance Computing Center 1 (AH-PCC), denoted in this paper as RoadRunner. Roadrunner is an Alta Technology Corporation64 (dual processor) node AltaCluster at AHPCC containing 128 Intel 450 MHz PentiumII (512KB L2 cache) processors. Each node has 512MB of 100MHz SDRAM. The super-cluster uses the Linux OS (Red Hat distribution with a 2.2.10 kernel) and the processors areinterconnected via Fast Ethernet for control-communications, and a Myrinet 32-bit network(∼160MB/s hardware peak capacity) for high-speed data communications.

2. Cluster of 4 PCs, denoted in this paper as Muses. Each PC has an Intel 450MHz PentiumII (512KB L2 cache) processor, 384MB of 100MHz SDRAM and for the Fast Ethernet net-working, an Adaptec Quad 10/100TX PCI ANA-6944A, Pentium II optimised. The networktopology is point to point, utilizing the quad cards. The Red Hat 5.2 Linux distribution, up-graded to the 2.2.5 kernel (and associated modules) is used. The cluster cost less than $10,000,nine months ago, and with the current depreciation of computer hardware is much cheaper.

1which is a center of UNM

3

3. IBM RS/6000 SP, with Model F50 (Silver) nodes, denoted in this paper as SP2-Silver. Eachnode has 1GB memory and four 604e PowerPC 332 MHz (256KB L2 cache) processors. Thesystem is located at Brown University’s Technology Center for Advanced Scientific Computa-tion and Visualisation. It has 24 nodes, connected via an SP Switch with an MX node adapter(150MB/s hardware peak capacity).

4. IBM RS/6000 SP, with Model 39H (“Thin 2”) nodes, denoted in this paper as SP2-Thin2.Each node has a single 66MHz Power2 (128KB L1 cache, 128bit memory bus) processorand 128MB of memory. The system at Brown University’s Center for Fluid Mechanics has24 nodes connected via an IBM high performance switch with a TB2 node adapter (40MB/shardware peak capacity).

5. IBM RS/6000 SP, with Model 397 (“Thin 4”) nodes, denoted in this paper as P2SC. Thesystem is located at the Maui High Performance Computing Center and consists of 211 nodes,connected via an SP Switch. Each node has a single 160MHz processor, with 256 MB ofmemory.

6. SGI Onyx2 (195MHz), denoted in this paper as Onyx 2. This system is located at BrownUniversity’s Center for Fluid Mechanics and has 2GB of memory and 8 195MHz (4MB L2cache) R10000 processors.

7. SGI Origin 2000, denoted in this paper as NCSA. This system at the National Center forSupercomputing Applications. It has R10000 processors at 250MHz and 195 MHz, with4MB of L2 cache.

8. Fujitsu AP3000 , denoted in this paper as AP3000. The system is located at Imperial Col-lege’s Computing Center. It has 28 UltraSPARC 300MHz single processor nodes, with 256MB RAM per node, connected via a dedicated high-speed network (AP-Net) with a 200MB/speak hardware capacity.

9. SGI/Cray T3E-900 , denoted in this paper as T3E. The system is located at the Naval O-ceanographic Office Major Shared Resource Center. It has 816 450MHz (Alpha 21164Abased) nodes, 128 of them with 512MB of RAM, the rest with 256MB.

10. Super Technical Server HITACHI SR8000 , denoted as HITACHI. It is located at the U-niversity of Tokyo and has 128 nodes with 8 custom PA-RISC based CPUs per node, withpeudo-vector capabilities and 8 GB shared memory per node. A three-dimensional crossbaris employed for the inter-node network (1GB/s hardware capacity).

0

500

1000

1500

2000

100 1000 10000 100000 1e+06

MB

/sec

array size (Bytes)

SP2-Thin2SP2-Silver

MusesAP3000

Onyx2

0

200

400

600

800

1000

1200

1400

1600

1800

2000

100 1000 10000 100000 1e+06

MB

/sec

array size (Bytes)

T3ESP2-P2SC

Muses

Figure 1: Speed of dcopy in MBytes/sec against array size. Comparison of a Pentium 450Mhzmachine, used for Muses, to various supercomputers. For the Cray T3E, the tests were run withhardware prefetching (STREAMS) enabled.

4

0

20

40

60

80

100

120

140

160

180

200

100 1000 10000 100000 1e+06

Mflo

p/se

c

array size (Bytes)

SP2-Thin2SP2-Silver

MusesAP3000

Onyx2

20

40

60

80

100

120

140

160

180

200

100 1000 10000 100000 1e+06

Mflo

p/se

c

array size (Bytes)

T3ESP2-P2SC

Muses

Figure 2: Speed of daxpy in Mflops/sec against array size. Comparison of a Pentium 450Mhzmachine, used for Muses, to various supercomputers. For the Cray T3E, the tests were run withhardware prefetching (STREAMS) enabled.

3 Kernel Level Tests

3.1 CPU Tests

The use of BLAS and LAPACK (also based on BLAS) subroutines in scientific codes has becomea widespread practice. BLAS routines account for most of the work in the codes presented. Themost important routines are matrix-matrix products (C = αA × B + βC) dgemm(), matrix-vectorproducts (y = Ax + y) dgemv(), vector copy (y = x) dcopy(), vector scale and add (y = αx + y)daxpy(), and inner product (s = xy) ddot(). Comparison of BLAS performance between differentmachine architectures is important in assessing the applicability of each architecture to DNS. Com-parisons of single CPU performance in MB/s and MFlop/s are presented by timing repeated callsto the aforementioned BLAS subroutines. Any sharp outliers seen have been repeatedly measuredto ensure they are not a transient effect. The vectors/matrices used are in-cache for sizes that fitin a certain level of the c! ache hierarchy, producing an upper bound on performance. No efforthas been made to compensate for the time spent in routine initialization, thus the figures presentedcorrespond to the performance as seen by the user and not to algorithmic performance. For all casesthe vendor optimized BLAS library was used, i.e. Sun’s LIBPERF for the AP3000, IBM’s ESSL forthe SPs, SGI’s SCSL for the Onyx2, Cray’s SCILIB for the T3E and Intel’s ASCI Red free BLASdistribution for the PC/Linux cluster. Figures 1- 5 compare the hardware 2 described in the previoussection, with the general conclusion that the T3E and the SP2-P2SC nodes being superior to all theother architectures tested.

0

100

200

300

400

500

600

100 1000 10000 100000 1e+06

Mflo

p/se

c

array size (Bytes)

SP2-Thin2SP2-Silver

MusesAP3000

Onyx2

0

100

200

300

400

500

600

100 1000 10000 100000 1e+06

Mflo

p/se

c

array size (Bytes)

T3ESP2-P2SC

Muses

Figure 3: Speed of ddot in Mflops/sec against array size. Comparison of a Pentium 450Mhz ma-chine, used for Muses, to various supercomputers. For the Cray T3E, the tests were run with hard-ware prefetching (STREAMS) enabled.

2Both Muses and RoadRunner use Pentium II, 450Mhz processors.

5

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200

Mflo

p/se

c

array size (Bytes)

SP2-Thin2SP2-Silver

MusesAP3000

Onyx2

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200

Mflo

p/se

c

array size (Bytes)

T3ESP2-P2SC

Muses

Figure 4: Speed of dgemv in Mflops/sec against array size. Comparison of a Pentium 450Mhzmachine, used for Muses, to various supercomputers. For the Cray T3E, the tests were run withhardware prefetching (STREAMS) enabled.

For the BLAS Level 1 routines, daxpy(), dcopy() and ddot(), the PC performance for data that fit inthe first level of cache is among the best of the architectures examined (Figures 1-3). For vectorsresident in L2 cache, performance is on par or slightly faster than that of the PowerPC 604e based“Silver” node SP2 and lower than the rest with the notable exception of ddot(). For data that needsto be fetched from main memory, all OS kernels are memory bandwidth bound, and the PC platformperforms well due to its fast 100MHz SDRAM memory subsystem.

For the BLAS Level 2 routine dgemv() (Figure 4), a higher percentage of the peak performance , isexpected to be reached for in-cache measurements. This is the case for most architectures other thanthe PC, where the ddot() performance is actually unmatched. The performance drop for going to L2cache is not large (for all but the “Silver” node SP) but once outside of L2 cache, performance dropssignificantly, as it becomes main-memory-bandwidth bound. Again, the good memory bandwidthof the PC platform allows it to be competitive with the SP “Thin2” node and the SGI Onyx2.

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600

Mflo

p/se

c

array size (Bytes)

SP2-Thin2SP2-Silver

MusesAP3000

Onyx2

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600

Mflo

p/se

c

array size (Bytes)

T3ESP2-P2SC

Muses

Figure 5: Speed of dgemm in Mflops/sec against array size. Comparison of a Pentium 450Mhzmachine, used for Muses, to various supercomputers. For the Cray T3E, the tests were run withhardware prefetching (STREAMS) enabled.

In many computational mechanics and dense linear algebra codes, the matrix-matrix BLAS Level 3kernel dgemm() is very important (Figures 5,6). The maximum possible performance achievable fora given system for a large enough matrix size n×n is expected. This is because the ratio of floatingpoint operations to memory references isO(n). Given also the fact that the PC peak (hardware/neverto be exceeded) performance is 450MFlop/s, while most of the other machines have higher peaks(up to 666MFlop/s for the “Silver” node SP), it is not surprising that the PC performance curve islower than that of most of the competition (Figure 5). However, most of the calls to dgemm() in theN εκT αr codes presented are for small n (10 or less). The small matrix performance of the PCplatform is shown in Figure 6.

6

0

100

200

300

400

500

600

700

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Mflo

p/se

c

array size (Bytes)

SP2-Thin2SP2-Silver

MusesAP3000

Onyx2

0

100

200

300

400

500

600

700

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Mflo

p/se

c

array size (Bytes)

T3ESP2-P2SC

Muses

Figure 6: Speed of dgemm in Mflops/sec against small array size. Comparison of a Pentium 450Mhzmachine, used for Muses, to various supercomputers. For the Cray T3E, the tests were run withhardware prefetching (STREAMS) enabled.

3.2 Communication Tests

Simple unidirectional (Ping-Pong) latency and bandwidth testing is performed with NetPIPE 2.3(Figure 7). For the supercomputers the vendor-optimized MPI libraries are used. For the RoadRun-ner cluster, the MPICHversion 1.1.2 release of Myricom with the GM version 1.04 communicationlayer is used, while the PC cluster, Muses, uses the free MPI implementations MPICH (v1.1.1) andLAM (v6.1). Performance tuning of the underlying TCP layer on Muses, involved changing to a 2.2based Linux kernel with some parameter tuning and a one-line change in the LAM low level TCPcode (John Dennis, private communication). This was possible thanks to the open source nature ofboth Linux and LAM.

Point-to-point connection topology is used in the Muses; a larger cluster would at some point requirea switched configuration, incurring a small latency penalty. The latency numbers for Muses are lowenough to be competitive with some of the supercomputers. With the use of the emerging M-VIAbased MPI implementations latency is expected to go to the sub-50 microsecond range (reportedvalues for the underlying M-VIA (1999) implementation are 23µs). The bandwidth numbers for thePC cluster are currently limited by the Fast Ethernet peak bandwidth.

On the high-end of PC clusters, both internode and intranode tests are performed on the RoadRunnercluster using both their ethernet and myrinet network. The ethernet network produces high latencyand low bandwidth, compared to Muses and the other systems, with the inter and intra-node com-munications distinctly different. The inter-node myrinet network is comparable to the SP2-Silvernodes and better than the AP3000 and SP2-Thin with respect to latency. The bandwidth recorded,though, is lower than most systems, apart from the SP2-Thin2.

Extensive communications tests for parallel Fourier-spectral/hp Navier-Stokes Solvers were per-formed by Evangelinos & Karniadakis (1996), indicating that MPI Alltoall is the most communi-cation intensive and expensive, straining the networks to their limit. Based on their conclusions,timings are performed for MPI Alltoall by globally synchronising and then timing a loop callingMPI Alltoall. The timing was done on all processors, repeated 100 times, with the statistics calcu-lated on the sample of 100 statistical measurement. If access to a high-resolution clock is available,that is the one used, otherwise, MPI-Wtime is used. Should the resolution be low for small mes-sage sizes, multiple invocations of the routines are timed in the innermost loop, divided by theirnumber, in order to obtain the performance numbers. Apart from the T3E, which is 3 times higherthan the rest, the myrinet network has a slightly higher bandwidth than the IBM SP2 Thin! 2 nodes(inter-node for the SP2-Silver) and slightly lower than the NCSA Origin 2000. Although not shown,tests were also performed on the Super Technical Server HITACHI SR8000 which had a minimumrecorded bandwidth of 450 Mbytes/sec for a message size of 6,400,000 bytes.

7

0 100 200 300 400 500 6000

50

100

150

200

250

300

message size (bytes)

late

ncy

(µ s

ec)

100

101

102

103

104

105

106

107

108

109

0

20

40

60

80

100

120

140

160

180

message size (bytes)

Ban

dwid

th (

MB

/sec

)

AP3000 SP2−Thin2 SP2−Silver, internodeSP2−Silver, intranodeMuses, MPICH Muses, LAM Onyx 2 R.Run, eth.−intranodeR.Run, eth.−internodeR.Run, myr.−intranodeR.Run, myr.−internodeT3E

Figure 7: Left: Ping-Pong one-way latency; Right: Ping-Pong one-way bandwidth. Legend appli-cable to both plots.

100

101

102

103

104

105

106

107

0

20

40

60

80

100

120

140

160

Message size (Bytes)

Ave

rage

Ban

dwid

th (

MB

ytes

/sec

onds

)

AP3000 T3E RoadRunner eth. RoadRunner myr. SP2−Silver internodeSP2−Silver intranodeSP2−thin2 NCSA Muses

100

101

102

103

104

105

106

107

0

20

40

60

80

100

120

Message size (Bytes)

Ave

rage

Ban

dwid

th (

MB

ytes

/sec

onds

)

AP3000 T3E RoadRunner eth.RoadRunner myr.SP2−Silver SP2−thin2 NCSA

Figure 8: MPI Alltoall: Average Bandwidth for different architectures. Left, 4 Processors; Right, 8Processors.

8

3.3 Conclusions on the Kernel Level Tests

Although the T3E and SP2-P2SC machines are superior to the PC clusters, with the rapid improve-ment of PC CPUs, the difference is likely to quickly narrow. The main issue is the inter-processorcommunications. Ethernet-based networks have low bandwidth and high latency, compared to thesupercomputers available, while the bandwidth peak is nearly half of most machines. Myrinet-basednetworks are close to supercomputer capabilities, although their one-way bandwidth is low for largemessage sizes.

4 Application Level Tests

Serial and parallel algorithms for unstructured grids are tested, based on a new class of hierachicalspectral/hp methods appropriate for tensor-product representations in hybrid subdomains, i.e. tetra-hedra, hexahedra, prisms and pyramids (Karniadakis & Sherwin 1999). For the serial algorithms,two- dimensional simulations are used, to evaluate the single node performance of the PCs. Parallelalgorithms are tested through the use of two parallelisation techniques.

Unstructured spectral/hp elements (Sherwin & Karniadakis 1995, Warburton 1999) is based on ahigh-order scheme with the capability of efficient discretisation of complex geometries. The funda-mental concept, in two dimensions, consists of discretising the computational domain with triangularor quadrilateral cells, and approximating the solution within each cell using a series of polynomials(expansion basis), the order of which may be chosen arbitrarily. Convergence may, therefore, beachieved either by increasing the number of cells, or by increasing the order of polynomials. Themodal basis used within each cell is hierarchical, starting from linear, proceeding to quadratic, cubice.t.c., as shown for a two-dimensional triangular and quadrelateral cell (or element) in figure 9. Fora Helmholtz or Poisson type equation, the Laplacian matrix is symmetric and banded, as indicatedin figure 10 for two dimensions.

The Navier-Stokes equations,

∂V

∂t+ (V · ∇)V = −∇p + ν∇2V, ∇ ·V = 0. (1)

are integrated in time using a high-order splitting scheme (Karniadakis, Israeli & Orszag 1991). Forthe purposes of this paper, a second order time-integration is used, summarised in three main steps:

1. Advance the non-linear terms, (V · ∇)V. The velocity fields and their derivatives are eval-uated at quadrature points for each element. The non-linear terms are explicitly advanced intime; this involves vector multiplications and additions.

2. Solve for the pressure, which translates into a Poisson-type equation. The Laplacian matrix,has the form shown in figure 10.

3. Advance the viscous terms; this translates into a Helmholtz-type equation. The Helmholtzmatrix, has the form shown in figure 10.

For three dimensions with non-periodic geometries or flows, tetrahedral, prism, hexahedral maybe used (Karniadakis & Sherwin 1999). The parallelisation is based on a multi-level graph de-composition method (METIS) developed at the University of Minnesota (Karypis & Kumar 1995),which was extended to suit the specific characteristics of the spectral/hp method implemented in theN εκT αr code. An intrinsic element based domain decomposition is employed. This geometry-based distribution of work among processors allows for a high degree of parallelism. To test thistype of parallelisations, the fully three-dimensionalN εκT αr -ALE version is utilised, based onthe three-dimensional spectral/hp element discretisation and the arbitrary Lagrangian-Eulerian for-mulation (ALE). This allows for moving three-dimensional geometries, with the extra computation-al cost associated with the ALE version split in two steps (Warburton 1999). An added term in

9

0

12

3

4

5

678

9

10

11

12

1413

0

1

2

3

45

6

78

9

1011

12

13

14

15

1617

18

19

20

21

22

23

24

pq

q p

Figure 9: Ordering of the expansion coefficients for the modified expansion in a triangular andquadrilateral region with polynomial order 4. In this ordering we label the vertices first, followed bythe edges, and finally the interior. For the interior modes the q index is allowed to run fastest.

a) b)

Figure 10: Structure of the two-dimensional elemental (i.e. local) Laplacian matrix at polynomialorder 4; (a) standard modal triangular expansion, (b) standard modal quadrilateral expansion. Inboth plots the boundary degrees of freedom were ordered first followed by the interior degrees offreedom. The matrix is symmetric and we note the banded structure of the interior-interior matrix.

the non-linear step, associat! ed with the updating of the positions of the vertices of each element,and an extra Helmholtz solve, associated with the calculation of the velocity of the moving mesh.A diagonally preconditioned conjugate gradient iterative solver is predominantly used in this typesimulations.

For three-dimensional simulations of geometries with at least one homogeneous direction, a straight-forward mapping of Fourier modes to P processors is utilised, resulting in an efficient and balancedcomputation where the three-dimensional problem is decomposed into two-dimensional problem-s using multiple one-dimensional FFTs (mainly added to step 1). Each processor will compute acertain number of Fourier modes. Typically, one processor is assigned to one Fourier mode whichcorresponds to two spectral/hp element planes. Therefore, if four Fourier modes are used for re-solving the homogeneous directions, then there are eight spectral/hp element planes. Increasing theresolution in the homogeneous direction, i.e. increasing the number of Fourier planes can be accom-plished without increasing the amount of memory required per processor if the number of processorsis increased. The Poisson and Helmholtz-type equations are solved using direct solves, consideringthe banded and s! ymmetric nature of the Laplacian matrices.

-15 -10 -5 0 5 10 15 20 25x

-5

0

5

y

Figure 11: Meshes used. Left: Mesh used for bluff body serial simulations, and N εκT αr -Fsimulations; Right: Geometry of flapping wing simulation performed forN εκT αr -ALE 3D

10

4.1 DNS/Serial algorithms

In order to evaluate the single node performance serial algorithms for DNS, two dimensional simula-tions are performed. Strictly, a two- dimensional simulation may not be considered a DNS, however,because 2-D and 3-D serial simulations use similar algorithms, 2-D serial simulations are presentedas sufficient for the objectives of this paper.

Both triangular and quadratic cells may be used, depending on the requirements of the simulation.For the bluff body flow simulation benchmarked, 230,000 degrees of freedom are used. Figure 11(Left) shows the mesh and computational domain used, with 902 elements and polymomial orderof 8. Neumann boundary conditions (i.e. zero flux) were used at the outflow and on the sidesof the domain, with the inflow being a laminar flow equal to a non-dimensionalised velocity of 1.Maximum CPU times per time step are shown in Table 1, calculated with the call, clock. Only directsolver methods are presented and it is clear that only the P2SC nodes are faster than the PC, with theT3E being just as fast.

Table 1: CPU time for serial algorithm bluff body simulation.Machine CPU time

(seconds) / time stepFujitsu AP3000 1.22Onyx 2 1.03Pentium II, 450Mhz 0.81SP2 “Thin2” nodes 1.44SP2 “Silver” nodes 1.3T3E 0.82P2SC 0.71

Figure 12 shows internal timings, within a representative time-step. The vector lengths are approxi-mately a quarter of the degrees of freedom, i.e. 15,000 long. The timings are split into 7 regions:

1. Transformation from modal (transformed) to quadrature (physical) space. It consists primarilyof vector multiplications and additions.

2. The non-linear terms are evaluated in quadrature space for the current time step. It consists ofvector multiplications and additions.

3. The non-linear terms just evaluated are weight-averaged with non-linear terms of previoustime-steps. It consists of vector multiplications and additions.

4. Setup of the right hand side of the Poisson equation for the pressure equation. It consists ofvector multiplications and additions.

5. Solution of the Laplacian for the Poisson equation. A direct solver (LAPACK), utilising thesymmetric and banded nature of the matrix, is used.

6. Setup of the right hand side of the Helmholtz equation for the time-advancing of the viscousterms. It consists of vector multiplications and additions.

7. Solution of the Laplacian for the Helmholtz equation. A direct solver, utilising the symmetricand banded nature of the matrix, is used.

It is evident that the matrix inversions account for 60% of the total CPU time, with the setup ofthe right hand side of the Poisson and Helmholtz equations, taking another 20%. There is a smalldifference between architectures of less than 1-2%.

11

4%

11%

3%

9%

30%

12%

31%

SGI Onyx 2 1234567

3%

10%

5%

8%

31%

11%

32%

Pentium PII, 450Mhz 1234567

Figure 12: CPU time for serial algorithm bluff body simulation: percentage of each stage within atime step.

4.2 DNS/Parallel algorithms

Parallel algorithms for unstructured grids are extensively used in DNS (Evangelinos 1999, Warbur-ton 1999). Both parallelisation techniques, i.e. domain decomposition and Fourier mapping, areinvestigated, by testingN εκT αr -ALE andN εκT αr -F respectively. CPU times are calculatedusing the clock command, while wall-clock times are calculated using MPI Wtime. The differ-ence between the two types of timings indicates idle CPU time, which is associated with networkinefficiency.

4.2.1 N εκT αr -F

N εκT αr -F is first tested, for a bluff body turbulent simulation with 461,000 degrees of freedomper processor. The same two-dimensional mesh (figure 11, left) is used as the serial simulations,with a spanwise length of 2π. A very important advantage of the spectral/hp-Fourier decompositionis its speed as direct solvers may be employed for the solution of 2D Helmholtz problems on eachprocessor. Also, increasing the resolution in the direction discretised by Fourier series, can beaccomplished without increasing the memory required per processor, if the number of processorsis increased accordingly. With the real and imaginary parts of a Fourier mode sharing the samematrices, for a slightly larger communication overhead, memory savings may be achieved. Highresolutions may be achieved at relatively low memory requirements per processor.

Table 2: Parallel N εκT αr -F: CPU/Wall clock time of bluff body simulations for P number ofprocessors.

P AP3000 NCSA SP2 SP2 RoadRunner RoadRunner Muses“Silver” “Thin2” ethernet myrinet

2 4.23/4.31 3.62/3.63 4.92/4.93 5.74/5.81 5.28/5.81 3.99/3.99 4.32/4.7574 4.52/4.59 4.96/4.99 5.94/5.96 5.91/5.98 6.99/8.27 4.15/4.15 5.59/6.28 4.71/4.79 4.17/4.2 6.53/6.56 6.18/6.23 9.92/11.47 4.27/4.27 n/a16 4.63/4.74 5.12/5.15 6.71/6.74 6.3/6.39 18.47/22.13 4.64/4.66 n/a32 n/a 4.85/4.88 6.95/6.99 n/a 12.81/23.865 4.606/4.606 n/a64 n/a 4.24/4.26 6.93/6.93 n/a 13.13/30.21 7.71/7.71 n/a128 n/a 5.12/5.16 n/a n/a n/a 11.14/11.14 n/a

The main types of communication used inN εκT αr -F are:

• Global Exchanges (All-to-All), in step 2.

12

4%

41%

4% 6%

15%

9%

22%

NCSA, CPU timing4%

41%

4% 6%

14%

9%

22%

NCSA, Wall−clock timing

1

2

3

4

5

6

7

2%

53%

5%

5%

11%

7%

17%

IBM SP2 "Silver", CPU timing2%

53%

6%

5%

11%

7%

17%

IBM SP2 "Silver", Wall−clock timing

1

2

3

4

5

6

7

Figure 13: N εκT αr -F: CPU and Wall-clock time for bluff body simulation. Percentage of eachstage within a time step. 4 Processor simulation.

2%

69%

3%

4%

9%

8%

6%RoadRunner eth., CPU timing

1%

71%

2%2%

8%

9%

7%RoadRunner eth., Wall−clock timing

1

2

3

4

5

6

7

3%

55%

4%

5%

11%

8%

14%

RoadRunner myr., CPU timing3%

55%

4%

5%

11%

8%

14%

RoadRunner myr., Wall−clock timing

1

2

3

4

5

6

7

Figure 14: N εκT αr -F: CPU and Wall-clock time for bluff body simulation. Percentage of eachstage within a time step. 4 Processor simulation.

• Global Reduction operations,

– Global Addition (AllReduce) for any possible outflow boundary conditions in step 4.

– Global Addition, min, max for any runtime flow statistics.

• Gather, for possible tracking of flow variables during on-the-fly analysis of data.

• Sends (all but processor 0) and Receives (processor 0) for output of the solution field (ifrequired).

The most computationally intensive step, in this code, is step 2. Due to the parallelisation, step 2 ismodified as:

• Global Exchange (All-to-All) of the velocity components.

• Nxy 1D inverse FFTs for each velocity component, whereNxy is the number of points in onex− y plane, divided by the number of processors.

• Computation of the non-linear step (step 2).

• Nxy 1D FFTs for each non-linear component.

• Global Exchange of the non-linear components.

This type of algorithm relies heavily on Global Exchange MPI Alltoall. It is extensively used in any2D or 3D FFT-based solver, since it supports the transposition of a distributed matrix. For example,any spectral distributed memory homogeneous turbulence in a “box code” heavily relies in this typeof communication. During MPI Alltoall in N εκT αr -F, each processor communicates with allthe others with message sizes of

Γ

P×Nz

P,

13

where P is the number of processors,Nz is double the number of Fourier modes used, and Γ is thenumber of degrees of freedom per velocity field(Evangelinos, Sherwin & Karniadakis 1999).

Table 4.2.1 3 presents the CPU and wall clock times for the different machines for the same meshwith the number of Fourier planes adjusted so that only 2 planes are allocated at each processor, i.e.the 2 processor run will use 4 Fourier planes (23,000 degrees of freedom), while the 4 processor runwill use 8 Fourier planes (46,000 degrees of freedom). As the number of processors increases, sodoes the degrees of freedom, thus the timings presented should be constant. It is clear that for upto 4 processors, the ethernet-based PC networks provide acceptable performance, with the myrinetnetwork being faster than the IBM SP2 nodes and the AP3000. The ethernet-based network seemsto saturate above 8 processors, while the myrinet network saturates above 64 processors.

Internal timings shown in figures 13-14 indicate that the main computational cost occurs at thenon-linear step 2. The difference between the serial and parallel simulations is the addition of inter-processor communications between the distributed spectral/hp planes. MPI Alltoall is extensivelyused in step 2, in order to perform the fast Fourier transforms required in this type of algorithm. Thiscreates a bottleneck in communications, which is apparent in the PC clusters, where step 2 takes asmuch as 60% of the time.

Table 3: Parallel N εκT αr -ALE 3D: CPU/Wall clock time of flapping wing simulations for Pnumber of processors.

P AP3000 NCSA SP2 SP2 RoadRunner“Silver” “Thin2” myrinet

16 43.23/43.674 25.71/25.79 29.59/29.71 65.47/69.21 25.38/25.432 n/a 9.87/10.08 15.82/15.85 n/a 13.57/13.5864 n/a 6.97/6.99 9.37/9.4 n/a 9.83/9.87128 n/a 5.72/6.04 n/a n/a n/a

4.2.2 N εκT αr -ALE

N εκT αr -ALE is a fully 3D high-order code, allowing both spatially fixed and varying geome-tries. The basic characteristics of this type of parallel spectral/hp code are described in Evangelinos,Sherwin & Karniadakis (1999). A local and a global mapping of the boundary information is re-quired, for an arbitrary domain partitioning and a communication interface is introduced to treatglobal operations. Further discussion on the implementation of this algorithm and code may befound in Warburton (1999). In summary, a term is added in the non-linear step 2, associated with theupdating of the positions of the vertices of each element; it consists of matrix/vector multiplicationsand additions. An extra Helmholtz solve, is also added in step 7, associated with the calculation ofthe velocity of the moving mesh. Instead of direct solvers, a diagonally preconditioned conjugategradient iterative solver is predominantly used in this type of simulations.

The communication interface used, was designed by Tufo & Fischer (“Gather-Scatter”, GS, library,Tufo 1998). This interface, which is not specific to this code, but may be widely applied, allows forthe treatment of all the communications using a “binary-tree” algorithm, “pairwise” exchanges, ora mix of these two. Pairwise exchange is used for communicating values shared by only a few pro-cessors, while the “binary-tree” approach is used for values shared by many processors. The latterapproach, is essentially a global reduction operation on a subset of the total number of processor. It32-16 processor simulations were performed using the 195Mhz Origin 2000, while the 32-128 processor simulations used

the 250Mhz processors, due to the queuing system at NCSA.

14

should be stressed that MPI Alltoall used inN εκT αr -F is not used in this approach.

A flapping wing (NACA 4420) is simulated using 4,062,720 degrees of freedom, at a Reynoldsnumber,Re = 1000. The geometry of the wing is shown in Figure 11 (Right). Over a computationaldomain of 10 by 5 by 5, 15,870 elements are used, with a polynomial order of 4. Due to thecomputational cost associated with such a simulation, more than 16 processors have to be used.Table 3 5 shows the CPU and wall clock timings per timestep, calculated in a similar manner asfor N εκT αr -F. In this case, the number of degrees of freedom is fixed, thus the timings dropwith increasing number of processors. The myrinet-based RoadRunner cluster is slightly slowerthan the SP2-Silver nodes at 64 processors, while for 32 processors, it is faster 6. For 16 processors,the PC cluster is faster than the rest of the machines presented. The AP3000 and the SP2-Thin2have such performance, due to marginal memory resources. Internal timings are shown in figures15 and 16, where a encompasses steps 1-4,6, b is step 5 and c is step 7. Since this algorithmdoes not use MPI Alltoall, but Global Reduction/Gather operations (addition, min, max), GlobalSynchronisation and pairwise and partial binary tree reduction operations (GS library), the timingsare distributed equivalently to the serial simulations, weighting on steps 5 and 7.

9%

41%

50%

NCSA, CPU timing

9%

41%

50%


a

b

c

6%

42% 53%


42% 53%


a

b

c

Figure 15:N εκT αr -ALE 3D: CPU and Wall-clock time for flapping wing simulation. Percentageof each stage within a time step. 16 Processor simulation.

8%

40%52%

NCSA, CPU timing8%

40%52%


a

b

c

3%

42%

55%


42%

55%


a

b

c

Figure 16:N εκT αr -ALE 3D: CPU and Wall-clock time for flapping wing simulation. Percentageof each stage within a time step. 64 Processor simulation.

5 Discussion and Conclusions

The objective of this paper is to present an overall comprehensive evaluation of the applicability ofPC/Linux clusters for DNS of turbulence. Low and high-end PC clusters were compared to existingsupercomputers, both at kernel and application level, to evaluate the CPU and network capabilities.It may be conluded that:52-16 processor simulations were performed using the 195Mhz Origin 2000, while the 32-128 processor simulations used

the 250Mhz processors, due to the queuing system at NCSA.6The 128 processor simulations were not feasib! le due to the particular queue being unstable at the time of simulations

to obtain accurate timings.

15

• The single-processor kernel-level performance of the PC is not as good as the high-end super-computers, such as the T3E or the IBM SP2-P2SC. It compares well, though, to the rest of thesystems examined and may be considered a viable low-cost alternative.

• Ethernet-based networks are not competitive to supercomputer networks, if latency and band-with are considered. Kernel level tests indicated poor performance, compared to existingsupercomputers.

• Myrinet-based networks are competitive to supercomputer networks at low to medium mes-sage sizes according to the kernel level tests.

• Use of PC’s for serial algorithms indicate superior performance of the PC’s to most super-computers, apart from the T3E and IBM SP2-P2SC.

• Parallel simulations using ethernet-based networks indicate inefficiency in communicationsabove four processors. Internal timings indicate that the bottle-neck is due to MPI Alltoall.

• Parallel simulations using myrinet-based networks are competitive to supercomputers, yet areslightly slower for 64 processor simulations. At low-processor counts (up to 4), ethernet-basednetworks are likely more cost-efficient.

PC clusters are developed based on the assumptions of the high cost of supercomputers and rapiddevelopment of PCs. Direct speed and efficiency tests performed throughout this report indicate thatPC clusters are less efficient than supercomputers, yet not by far. There is a also a distinct differencebetween ethernet and myrinet-based PC clusters, increasing with increasing number of processors.Low number of processor ethernet-based networks are slower, yet provide better cost-effectivenessthan myrinet-based networks, which are cost-effective for high number of processor simulations.

Acknowledgments: This work was supported partially by DARPA, DOE, AFOSR, ONR and NSF.Computations were performed at the following computer centers: Maui High Performance Com-puting Center (MHPCC), National Center for Supercomputing Applications (NCSA), Major SharedResource Center of the Naval Oceanographic Office (NAVO), the AP3000 at Imperial College, theCenter for Advanced Scientific Computation and Visualisation at Brown University, the IBM SP2 atthe Center for Fluid Mechanics at Brown University and the UNM-Alliance Roadrunner Superclus-ter at the Albuquerque High Performance Computing Center supported by the University of NewMexico/NCSA. The authors are grateful to the help of Professor Nobu Kasagi at Tokyo University.At RoadRunner, the work presented was considerably aided by Brian Baltz and Patricia Kovatch.

References

[1] A. T. Anderson et al. 1992, LAPACK users guide, Society fo Industrial and Applied Mathemat-ics (SIAM), Philadelphia.

[2] D. Becker, T. Sterling, D. Savarse, J. Dorband, U. Ranawake and C. Parker 1995,BEOWULF: A parallel workstation for scientific computation, Proc. of the Interna-tional Conference on Parallel Processing, Urbana-Champaign, I11, Vol. I Architecture.http://cesdis.gsfc.nasa.gov/linux/beowulf/icpp95-.html

[3] C.H. Crawford, C Evangelinos, D.J. Newman and G.E. Karniadakis 1996, Parallel benchmarksof turbulence in complex geometries, Computers and Fluids, 25: 7, pp. 677-698.

[4] J. Dongarra, J. DuCroz, S. Hammarling and R. Hanson, An Extended Set of FORTRAN BasicLinear Algebra Subprograms, ACM,1988, Vol. 14, No. 1, pp. 1-17.

[5] J. Dongarra, J. DuCroz, S. Hammarling and I. Duff, A Set of Level 3 Basic Linear AlgebraSubprograms, ACM,1990”, Vol. 16, No.1, pp. 1-17.

16

[6] C. Evangelinos 1999, Parallel Simulations of Vortex-Induced Vibrations in Turbulent Flow:Linear and Non-Linear Models 1999, Ph.D. Thesis, Division of Applied Mathematics, BrownUniversity.

[7] C. Evangelinos, S.J. Sherwin and G.E. Karniadakis 1999, Parallel DNS algorithms on unstruc-tured grids, in press at Computer Methods in Applied Mechanics and Engineering.

[8] G.E. Karniadakis and S.A. Orszag 1993, Nodes modes and flow clodes, Phy. Today 46, 34.

[9] G.E. Karniadakis, M. Israeli and S.A. Orszag 1991, High-order splitting methods for incom-pressible Navier-Stokes equations, J. of Comp. Phys., 97: 414.

[10] G.E. Karniadakis and S.J. Sherwin 1999, Spectral/hp element methods for CFD, Oxford Uni-versity Press.

[11] G. Karypis and V. Kumar 1995, METIS: Unstructured graph partitioning and sparse matrixordering system version 2.0, Department of Computer Science, University of Minnesota, Min-neapolis, MN 55455”.

[12] R.M. Kirby, T.C. Warburton, S.J. Sherwin, A. Beskok and G.E. Karniadakis 1999, The NektarCode: Dynamic Simulations without Remeshing, 2nd International Conference on Computation-al Technologies for Fluid/Thermal/Chemical Systems with Industrial Applications.

[13] LAM fix reference: John Dennis ([email protected]) private communications.

[14] LAM reference: http://www.mpi.nd.edu/lam/

[15] H. Lamb 1916, Hydrodynamics, ed. 4, Article 365, p.651.

[16] LAPACK Users’ Guide,Anderson E. et al.”, Society for Industrial and Applied Mathematics(SIAM), Philadelphia”,1992.

[17] C. Lawson, R. Hanson, D. Kincaid and F. Krogh, Basic Linear Algebra Subprograms for For-tran Usage, ACM,1979”, Vol. 5, No.3, pp. 308-323.

[18] Ma Xia, G-S Karamanos, G.E. Karniadakis 1999, Dynamics and Low-Dimensionality of theTurbulent Near-Wake, submitted to Journal of Fluid Mechanics.

[19] MPICH reference: http://www.mcs.anl.gov/mpi/mpich/

[20] MPI reference: http://www.mpi-forum.org

[21] M-VIA reference: http://www.nersc.gov/research/FTG/via/

[22] M-VIA latency reference: http://www.nersc.gov/research/FTG/via/faq.htmlq14

[23] M-VIA MPI reference: http://www.nersc.gov/research/FTG/via/faq.htmlq12

[24] R.D. Moser, J. Kim and N.N. Mansour 1999, Direct Numerical Simulation of turbulent channelflows up to Reτ = 590, Phys. of Fluids, Vol. 11, No.4.

[25] NetPIPE reference: http://www.scl.ameslab.gov/netpipe/

[26] S.A. Orszag and G.S. Patterson 1972, Numerical simulations of turbulence, Statistical modelsand Turbulence, pp. 127-147.

[27] C. Reschke, T. Sterling, D. Ridge, D. Savarese, D. Becker and P. Merkey 1996, A DesignStudy of Alternative Network Topologies for the Beowulf Parallel Workstation Proceedings, HighPerformance and Distributed Computing.

17

[28] D. Ridge, D. Becker, P. Merkey and T. Sterling 1997, Beowulf: Harnessing the Power ofParallelism in a Pile-of-PCs Proceedings, IEEE Aerospace.

[29] S. J. Sherwin and G.E. Karniadakis 1995, A Triangular Spectral Element Method; Applicationsto the Incompressible Navier- Stokes Equation, Comp. Meth. in Applied Mech. and Eng., 123:189.

[30] H.M. Tufo 1998, Generalised communication algorithms for parallel FEM codes. TechnicalReport, Center for Fluid Mechanics, Brown University.

[31] VIA reference:http://www.viarch.org/

[32] T. Warburton 1999, Spectral/hp methods on polymorphic multi-domain algorithms and appli-cations, Ph.D. thesis, Division of Applied Mathematics, Brown University.

18

direct numerical simulation of turbulence with a pc/linux ... · mpich and lam are used on...

Documents