scalability of a pseudospectral dns turbulence code with ... · pdf filecall scft, scft and...

21
SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalability of a pseudospectral DNS turbulence code with 2D domain decomposition on Power4+/Federation and Blue Gene systems D. Pekurovsky 1 , P.K.Yeung 2 ,D.Donzis 2 , S.Kumar 3 , W. Pfeiffer 1 , G. Chukkapalli 1 1 San Diego Supercomputer Center 2 Georgia Institute of Technology 3 IBM SP SciComp, Boulder CO, July 20, 2006

Upload: vuonglien

Post on 26-Mar-2018

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Scalability of a pseudospectralDNS turbulence code with 2D

domain decomposition onPower4+/Federation and Blue

Gene systemsD. Pekurovsky1, P.K.Yeung2,D.Donzis2,

S.Kumar3,W. Pfeiffer1, G. Chukkapalli1

1San Diego Supercomputer Center2Georgia Institute of Technology

3IBMSP SciComp, Boulder CO, July 20, 2006

Page 2: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Turbulence: examples

The small scales are important.

Computational Science of Turbulence – p.5/28

Page 3: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

DNS code

• Code written in Fortran 90 with MPI• Time evolution: Runge Kutta 2nd order• Spatial derivative calculation: pseudospectral method• Typically, FFTs are done in all 3 dimensions.• Consider 3D FFT as compute-intensive kernel

representative of performance characteristics of the fullcode

• Input is real, output is complex; or vice versa

Page 4: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

3D FFTUse ESSL library calls for 1D FFT on IBM, or FFTW on other systems

(FFTW is about 3 times slower than ESSL on IBM)

Forward 3D FFT in serial. Start with real array (Nx,Ny,Nz):• 1D FFT in x for all y and z

• Input is real (Nx,Ny,Nz).• Call SRCFT routine (real-to-complex), size Nx, stride is 1• Output is complex (Nx/2+1,Ny,Nz) – conjugate symmetry: F(k)=F*(N-k)• Pack data as (Nx/2,Ny,Nz) – since F(1) and F(Nx/2+1) are real numbers

• 1D FFT in y for all x and z• Input is complex (Nx/2,Ny,Nz)• Call SCFT (complex-to-complex), size Ny, stride Nx/2• Output is complex (Nx/2,Ny.Nz)

• 1D FFT in z for all x and y• Input and output are complex (Nx/2,Ny,Nz)• Call SCFT (complex-to-complex), size Nz, stride (NxNy)/2

Inverse 3D FFT: do the same in reverse order. Call SCFT, SCFT and SCRFT(complex-to-real).

Page 5: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

3D FFT cont’d

Note: Alternatively, could transpose array in memoryNote: Alternatively, could transpose array in memorybefore calling before calling FFTsFFTs, so that strides are always 1. In, so that strides are always 1. Inpractice, with ESSL this doesnpractice, with ESSL this doesn’’t give an advantaget give an advantage(ESSL efficient even with strides > 1)(ESSL efficient even with strides > 1)

Stride 1: 28% peak Flops on DatastarStride 32: 25% peakStride 2048: 10% peak

Page 6: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Parallel version

• Parallel 3D FFT: so-called transpose strategy, asopposed to direct strategy. That is, make sure all data indirection of 1D transform resides in one processor’smemory. Parallelize over orthogonal dimension(s).

• Data decomposition: N3 grid points over P processors• Originally 1D (slab) decomposition: divide one side of the cube over P,

assign N/P planes to each processor. Limitation: P <= N• Currently 2D (pencil) decomposition: divide side of the cube (N2) over

P, assign N2/P pencils (columns) to each processor.

Page 7: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Memory and compute power

• 2048^3 on 2048 processors – 230 MB/proc. This problem fits onDatastar and Blue Gene. Extensive simulations under way.

• 4096^3 on 2048 processors – 1840 MB/proc. This problem doesn’t fiton BG (256 MB/proc), and fits very tightly on Datastar.• Anyway, computational power of 2048 processors is not enough to solve

problems in reasonable time. Scaling to higher counts is necessary, certainlymore than 4096.

Therefore, using 2D decomposition is a necessity(P > N)

Page 8: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

1D Decomposition

Page 9: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

2D decomposition

Page 10: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

2D Decomposition cont’d

Page 11: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

CommunicationGlobal communication: traditionally, a serious

challenge for scaling applications to largenode counts.

• 1D decomposition: 1 all-to-all exchange involving Pprocessors

• 2D decomposition: 2 all-to-all exchanges within p1groups of p2 processors each (p1 x p2 = P)

• Which is better? Most of the time 1D wins. But again:it can’t be scaled beyond P=N.

Crucial parameter is bisection bandwidth

Page 12: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Alternative approaches attempted

• Overlap communication and computation• No advantage.

• Hybrid MPI/OpenMP• No advantage.

• Transpose in memory, call ESSL routines withstride 1• No advantage or worse

Page 13: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Platforms involved

• Datastar: IBM Power4+ 1.5 GHz• at SDSC, up to 2048 CPUs• 8 processors/node• Fat tree interconnect

• Blue Gene: IBM PowerPC 700 MHz• at SDSC, up to 2048 CPUs• at IBM’s T.J.Watson Lab in New York state, up to 32768

CPUs (2nd in Top500 list)• 2 processors/node• 3D torus interconnect

Page 14: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Performance on IBM Blue Gene and Datastar

64128

256

512

256

512

10242048

4096 8192

16384

32768

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

10 100 1000 10000 100000

Nproc

N^

3L

OG

2(N

)/(t

Np

)

CO 512^3

VN 512^3

CO 1024^3

VN 1024^3

CO 2048^3

VN 2048^3

CO 4096^3

VN 4096^3

Datastar 1024^3

Datastar 2048^3

VN: Two processors per nodeCO: One processor per node

Page 15: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

A closer look at performance on BGDNS 2048^3

20484096 8192

16384

32768

0

20000

40000

60000

80000

100000

120000

140000

0 5000 10000 15000 20000 25000 30000 35000

N proc

T x

P

VN total

CO total

VN comm

CO comm

Page 16: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Communication model for BG• 3D Torus network: 3D mesh with wrap-around

links. Each compute node has 6 links.• Modeling communication is a challenge for this

network and problem (mapping 2D processorgeometry to 3D network topology). Try to do areasonable estimate.

• Assume message sizes are large enough,consider only bandwidth, ignore overhead.

• Model CO (VN is similar)

Page 17: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Communication model for BG, cont’dTwo subcommunicators: S1 and S2

P = P1P2 = Px Py PzStep 1: All-to-all exchange among P1 processors within each group S1Step 2: All-to-all exchange among P2 processors within each group S2

By default tasks are assigned in a block fashion (although custommapping schemes also available to the user and are interestingoption)

• S1’s are rectangles in X-Y plane: P1= Px x (Py / k)• S2’s are k Z-columns: P2= k PzB=175MB/s, 1-link bidirectional bandwidth

Page 18: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Upper boundUpper bound

• Assume dimensions are independent• Find bottlenecks in each dimension, sum up the maximum time in

each dimension• Some links are idle some of the time

Take first step – communicator group is a plane Px X (Py/k)Assume torus (wraparound links) in x dimension, but not in y for k > 1Bisection bandwidth across y lines is b Px,Proceed in Px, stagesThe number of messages exchanged is Px, (Py/2k)2 for each stageTotal time for y-dimension bottleneck is ty = (Nb/B) Px, (Py/2k)2

Now independently consider X direction, and derive tx = (Nb/B) (Py/k) (Px /2 )2 (1/2)

Summing up, and using Nb=(4N3)/(P P1), we getT1 = (N3/P B) ((1/2)Px+(1/k)Py)

Page 19: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Upper bound cont’dNow step 2: k Z-lines in each communicator, all lie in Y-Z planesAgain, assume staged implementation. First communicate along y.Dimension size is k, bisection bandwidth is bPx but it’s shared among P1

groups. Do this Pz times.Time is Nb/b (k/2)2 P1Pz/Px (1/2)= (N3/Pb) Py/2Finally, exchange within Z-lines (k times)Tz = Nb/b (Pz/2)2 /2 (2 comes from torus links)So T2 = (N3/Pb) (Py+kPz/2)Summing up,

Tup = T1+T2 = (N3/P B) (Px/2 +(1/2+1/k)Py+Pz/2)For the lower bound, assume all links busy, and only the maximum time

counts. Obtain

Tlower = (N3/P B) [Max(Px/2,Py/k)+1/2 Max(Py,Pz)]

Page 20: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Communication model for BG

2048^3

1024

2048 4096

8192

16384

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0 5000 10000 15000 20000

P

Tco

mm

x P

CO 2D measured

CO 2D predic upper bound

CO 2D predic lower bound

Page 21: Scalability of a pseudospectral DNS turbulence code with ... · PDF fileCall SCFT, SCFT and SCRFT (complex-to-real). SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Summary• 2D decomposition enables significantly

increased scalability of the DNS turbulencecode

• Achieve good scaling on both IBM SP4 andBG/L (up to 32k processors)

• Ready for next generation of machines• DNS turbulence is one of the 3 Model Problems

in the recent NSF Petascale RFP