a strategy to move to exascale building on experience on roadrunner with multifluid ppm paul...

A Strategy to Move to Exascale

Building on Experienceon Roadrunner with Multifluid

PPM

Paul WoodwardUniversity of Minnesota

8/19/11

Let’s start small and work outward. This is a grid cell.

We will subdivide it evenly into a grid briquette.

We will subdivide it evenly into 64 cubical cells.The point of this operation isto achieve a minimum amountof grid, and hence of processing,uniformity. In our experience,this is an absolute requirementfor high performance.Accelerators and also CPUs gettheir best performance by using a SIMD engine to domany calculations simultaneously. All the operands,

on each cycle, must be perfectly aligned in packed data types of 4, 8, 16, or 32 words. Indirect addressing causes a major disruption of this highly efficient mode of operation. The briquette is a tiny, uniform domain.

The highlighted “grid plane” will consist of either:4 quadwords (Cell, Power7, Opteron, Nehalem),2 octowords (Intel Sandy Bridge), or1 hexadecaword (Intel MIC, Nvidia Fermi).

You can think of the grid plane as a vector, or on future devices we may have to think of the entire grid briquette as a vector. For Nvidia today, one must specify 64-wide operations, even though only 16 are actually done simultaneously.

The idea is to build AMR out of severalprimitive operations, all of which canbe implemented for briquettes atextremely high efficiency.

These include: Refine an entire briquette. Coarsen an entire set of 8 briquettes. (Sometimes just 4.) Use a boundary condition to generate a ghost briquette. Update an entire briquette.All of these tasks are to be performed by a single thread.These tasks are done by a single SIMD engine running a single

thread of control, but they are nevertheless highly parallel.This fact is underscored by Nvidia’s jargon that redefines the

thread, so that a real thread is an Nvidia “thread block” (which must contain at least 64 Nvidia “threads”).

These tasks can only be truly efficient if many of them are composed into a fully pipelined code expression.

Pipelining for extreme efficiency byoptimal utilization of limited mainmemory bandwidth.

A single “assignable” task for a singlethread running on a single SIMDengine is:

Fetch or construct ghost briquette #0. If algorithm requires it, fetch or construct neighbor ghost

briquettes in the 2 transverse dimensions. Prefetch next briquette, possibly with transverse neighbors. Perform update work on ghost briquette(s). Prefetch next briquette(s) in the direction of this pass. If necessary, refine or coarsen transverse ghost briquettes

for real briquette #1 to produce transverse ghost cells required by the uniform-grid algorithm.

Update this briquette (#1), then do next prefetch. Update briquette (#2), then write back new briquette (#1),

then do next prefetch, . . . . . . .

The computation proceeds along a sequence of briquettes at same grid level.

In the on-chip cache workspace, we have many short segments of grid planes, each holding one variable and none > 5 planes.

These briquettes are in transit between main memory and the cache.

In the cache, we unpack arriving briquettes into our temporary segments, and we pack results into updated briquettes.

Only the “active” grid planes for each variable and intermediate result reside in the cache.

These occupy “revolving buffers” of grid planes, so that data is read into them from main memory, but for all temporary results, data is never written back.

For single-fluid PPM hydrodynamics, the necessary workspace for all these buffers, illustrated on the previous slide as segments of grid planes, is 31 KB.

For 2-fluid PPM+PPB hydrodynamics, 60 KB is needed.One must not forget that the instructions (the code)

must also reside in the on-chip cache. This can take up to 100 KB, but only one copy is needed for all on-chip cores.

256 KB per core is now standard, and it is sufficient.

The data traffic between main memory and the on-chip cache is clearly seen in the diagram.

This strategy completely eliminates reading and writing intermediate results over and over again from/to main memory.

Today’s GPUs do not have sufficient on-chip memory to permit this enormous computing cost savings.

They are therefore limited by their memory bandwidth.This is considerable, but it is not considerable enough.Consequently, GPUs will change if they are to become

engines for exascale computation, rather than just special purpose devices for teenagers.

Intel’s MIC CPUs have just as many cores, same clock, same SIMD width (of 16), but 5 times Nvidia’s on-chip memory, on a per-SIMD-engine basis.

The computation proceeds along a sequence of briquettes at same grid level.

On the latest devices, prefetching the data as shown doubles performance to 23%to 36% of peak.

These briquettes are in transit between main memory and the cache.

This means that we are already at the limit doing over 35 flops for each word read or written from/to main memory.

The pipelining as shown requires a string of briquettes all at the same grid refinement level.

How can I then do AMR this way?

Imagine that we have a surface, as shown, at which we want to refine our grid.

Here we will illustrate using only a single additional grid refinement level.

A “grid pencil,” or a strip of grid briquettes extending all the way through the domain volume is shown. The 2 briquettes through which the surface passes are refined by one level to give 16 briquettes, in 4 strips of 4 each.

Updating each of the 4 refined grid pencils becomes a separate (independent) task for a single SIMD engine. The 2 remaining coarse grid pencils also become separate update tasks.

Note that in this example, eventhough we refined only 2 briquettes, all the refined pencilupdates are efficient.

The cost to update a grid pencil ofjust one briquette is roughlydouble the cost of updating a single briquette that is a part of a long grid pencil.

The cost of the ghost cell processing and ghost briquette fetches is roughly that of updating one briquette.

Thus our 4 refined grid pencils of 4 briquettes each are 80% efficient.

It is generally coarse grid pencils that might be inefficient, but there are “none” of these.

Another note on efficiency:I have refined all 64 cells in my

briquette, but perhaps onlyhalf of these cells “really” needto be refined.

In that case, it is still more efficient toupdate the entire refined briquette in our super-fast fashion than to update the same small region with an optimal set of refined and coarse cells.

Remember, modern SIMD engines do 16 things at once, but only if they are the same thing, and only if they are right next to each other in a perfectly ordered sequence.

You can do fewer flops, but you can’t do them faster.

How do I do a real problem,which must have an immenselymore complicated grid?

I have a uniform reference grid at the coarsest refinement level.

I decompose the global domain into“grid bricks” of 32 coarsest cells on a side.

I allow only 3 grid levels (if you need more, my bricks will have only 128 of the finest cells on a side).

My bricks are cubes, and my cells are all cubes, but of course the domain need not be a cube.

The complexity of the grid in any individual grid brick is strictly limited, and I can handle it.

I (the framework, me, not you) identify all the grid pencils in my brick for this 1-D pass.

My list of grid pencils serves as a“task list.”

Each grid brick will be updated byonly a single node.

Thus a grid brick update is a taskon a task list for nodes, and thislist is administered by a “team leader” process, which is a separate MPI rank. (On RR, 4 ranks/node)

With each grid brick, the list of pencils is a task list for threads at the node that are administered by a master thread (or perhaps jointly using locks).

All these tasks are executed in a self-scheduled fashion.When a thread is done with one task, it gets another.Or it may schedule one task ahead.Big pencils are updated first.

How does all this relate to you and your code?Compatibility mode:You presently have for each MPI rank only one thread.You presently have a long list of cells, sorted with ghost cells last, and a

long list of sorted interfaces.I will create this same information for each grid brick and in the same

format.You can ignore my task list, all the briquette business, and all the fancy

performance-critical stuff, if you wish. Or for whatever while seems reasonable.

The only issue will be calling MPI routines, or ghostget-type wrappers for MPI routines, during an update.

Such routines could in principle be emulated, but they will not be needed, and they are very costly.

It is simplest to break your update into several, with MPI messaging delivering ghost cell values in between.

Your module should therefore just run, if perhaps modified to consist of multiple update segments or modified to use the more complete ghost cell data I will provide (whether you need it or not).

How does all this relate to you and your code?Exascale mode:You make some modifications, discussed below and

subject to negotiation, and then apply our precompiler to the modified code.

Your code is automatically pipelined.Your working data is automatically declared as the new

SIMD-friendly, aligned, multi-word data types.Your vectorizable loops (which YOU designate as such)

are automatically expressed in SIMD intrinsic functions.

Your briquette fetches are pipelined to become prefetches, using vendor-supplied intrinsic functions accessible only from the C language.

All your code is translated into C, you compile it. Done.

So, what are these code modifications?You have to tell me how to update a single grid

briquette.You really don’t have to tell me anything else.If you have a module, whether you are the author or

the author is dead but you can read it, then you know how to update a briquette.

All you have to do is express that in a fashion that the precompiler can understand.

What it can understand is negotiable, but if it must understand everything, it will take forever to write.

It DOES understand something, and it does that right now. So if you could get your code close to that, we would have much less work to do. (We would still have a lot of work to do, and would not remain idle.)

What does the precompiler understand right now?I have had my 2 students write the precompiler to

accept Fortran-W as input.Fortran-W is Wilhelmson Fortran – my name for it – and

it is the style of Fortran used by Bob Wilhelmson at the University of Illinois and NCSA.

I worked with his tornado codes in the early 2000s as a member of the NCSA “performance expedition.”

I showed Wilhelmson how to speed up his codes, and I personally translated his advection algorithm as an example, achieving a speed-up by a factor of 5.

The compiler people at Rice were supposed to automate this, but they never did. But they said they did. They wrote a paper about it, without crediting the source (they “already knew it”).

So, what is Fortran-W?Fortran-W is your dream of a Fortran code expression.I have never written in Fortran-W, before the students

started working on the translator.Fortran-W is too good to be fast.But when translated, it is very, very fast, because then it

is Fortran-I, our intermediate form of Fortran.Fortran-I is fast only when compiled with the old, 9.1

version of the Intel Fortran compiler.To be fast for a standard vendor compiler, Fortran-I

must be translated into C+intrinsics.This translation is performed by our precompiler’s “back

end.” Present targets: Cell, Power-7, Intel/AMD, MIC

The C+intrinsics beats every vendor Fortran compiler save Intel Fortran, version 9.1 (they’re now on 12).

Fortran-W by Example, Parabola Construction 1 :c do i = 2-nbdy,nx+nbdy do k = 1,nz!DEC$ VECTOR ALWAYSc!DEC$ VECTOR ALIGNED do j = 1,ny dal(j,k,i) = a(j,k,i) - a(j,k,i-1) absdal(j,k,i) = abs(dal(j,k,i)) enddo enddo enddoc

This is just our first loop of a series.It is a triple loop nest over the entire, augmented domain, with ghost cells in the X-dimension.In earlier operations, we created the ghost cells using the boundary conditions.Here we are simply generating first differences and their absolute values.In later loops, we will need to know these values at different indices in the X-dimension.Therefore we create them here using only a single subtract.The vector alignment directive is commented out, because no existing Fortran compiler will react

to it properly. But we leave it in, because they might fix those compilers some day. They might.

Everything is properly aligned, because we created all these arrays on the stack.Given their indexing, with i last, and with jmax*kmax = 16, they will all line up perfectly.Because they are on the stack, they will be in cache if they fit.After translation, they will fit. But not before.

Fortran-W by Example, Parabola Construction 2 :c do i = 3-nbdy,nx+nbdy-2 do k = 1,nz!DEC$ VECTOR ALWAYSc!DEC$ VECTOR ALIGNED do j = 1,ny adiff = a(j,k,i+1) - a(j,k,i-1) azrdif = 3. * dal(j,k,i) - dal(j,k,i-1) - adiff azldif = dal(j,k,i+2) - 3. * dal(j,k,i+1) + adiff ferror = .5 * ( abs (azldif) + abs (azrdif) ) / 1 (absdal(j,k,i) + absdal(j,k,i+1) + smalla) unsmooth = min (1., max (0., ferrfc * (ferror - crterr))) unsmth(j,k,i) = max (unsmooth, unsmth(j,k,i)) daasppm = .5 * (dal(j,k,i) + dal(j,k,i+1)) s(j,k,i) = 1. if (daasppm .lt. 0.) s(j,k,i) = -1. damax = 2. * min(s(j,k,i)*dal(j,k,i), s(j,k,i)*dal(j,k,i+1)) damon = min (s(j,k,i)*daasppm, damax) damon = s(j,k,i) * max (damon, 0.) smooth = 1. - unsmth(j,k,i) damnot(j,k,i) = smooth * daasppm + unsmth(j,k,i) * damon enddo enddo enddoc

Fortran-W by Example, Parabola Construction 2a :c do i = 3-nbdy,nx+nbdy-2 do k = 1,nz!DEC$ VECTOR ALWAYSc!DEC$ VECTOR ALIGNED do j = 1,ny adiff = a(j,k,i+1) - a(j,k,i-1) ETC. enddo enddo enddocHere, in our second loop nest, we have stepped in one grid cell from the right.This allows us to exploit the results of the first loop without doing redundant computation.Note that because we have put the X-dimension index last, we can perform this centered

difference without losing the perfect alignment of our vectors.This is ideal for computation using a SIMD engine.This alignment makes no difference on a classic Cray, but classic Crays no longer exist.On a classic Cray, we would also have so much main memory bandwidth that loops like this would

be efficient. But classic Crays, like classic Coke, no longer exist.On a modern machine, the performance of this loop would be terrible, unless the index extents

were so small that everything would fit into our cache. Setting nx = ny = nz = 16 does that.But then performance would still be bad, but not as bad. I will show you. It is hard to believe that

vendors cannot make stuff like this fly, but they can’t. It seems to be no one’s fault.

Fortran-W by Example, Parabola Construction 2b :Why isn’t Fortran-W, with the outer loops on i fused, good enough?There are two problems with this modified Fortran-W.First, to do a 1-D pass in a different direction, we must transpose the

data or access it at a stride.Modern computers just hate to do this, and they are very bad at it.

Crays did it just great. But, once again, there are no Crays anymore.

Second, in the most natural implementation of this, with i as the fast-running index and the inner loops done over i, we have unaligned vectors. These cannot be fast unless they are long. 32 elements is sort of OK, but 256 elements is definitely better. This tends to fill up our cache and thus to expel other useful data from it.

Doing things this way is OK.I wrote all my codes this way for decades, and I was pleased with them.But all my codes sped up by a factor of 3 when I went to the new

format back in 2006.And I got another factor of 2 from the better messaging that is implied

by the new format with its briquettes. So I’m sold. How about you?

Fortran-W by Example, Parabola Construction: Fusion do i = 3-nbdy,nx+nbdy-2 do k = 1,nz!DEC$ VECTOR ALWAYSc!DEC$ VECTOR ALIGNED do j = 1,ny adiff = a(j,k,i+1) - a(j,k,i-1) ETC. enddo enddo enddoTo make this code run fast, we must do several straightforward operations.First, we must fuse all the outer loops on the index i.Because we cannot execute the second loop for the first value of i, we must insert a test in front

of it that detects this bad value of the index.We will have, inside the outer loop on i, several inner loops over j and k, all with tests in front.Next we fuse the j and k indices, turning the inner double loop nests into single loops, all of which

are transparently vectorizable, but all of which will be preceded by a vector assertion.Now we reset the parameters (and they must be parameters) nx, ny, nz all to 4.You might think this would be enough, but it is not.Now we go through the entire body of the loop on i, changing all the i values to right-justify all

the inner loops, so that the highest value of i in them will be nx+nbdy+3.Then we see for each variable which values of i, i-1, i-2, etc. are ever referenced.Then we reduce the i-dimension of each of these variables to accommodate only these values.Finally, we insert barrel shifts of indices for the resulting revolving buffers of grid planes.

Fortran-W by Example, Parabola Construction: Fusion 1c parameter (nbdy=4) parameter (nsugar=nbdy) parameter (nssq=nsugar*nsugar)cc dimension dd(nssq*nsugar*10,0:ncubes+1,0:ncubes+1,0:ncubes+1) dimension dd(nssq*nsugar*10, & (ncubes+2)*(ncubes+2)*(ncubes+2))cc There follows the local memory context of this routine.c It is less than 25 KB.cc common / theDs / d(nssq,nsugar,10,0:1), dnu(nssq,nsugar,10,0:1) dimension d(nssq,nsugar,10,0:1), dnu(nssq,nsugar,10,0:1)c dimension dcube1(nssq*nsugar*10)c dimension rho(nssq,0:4), rhonu(nssq) dimension p(nssq,0:4), pnu(nssq) dimension ux(nssq,0:4), uxnu(nssq) dimension uy(nssq,0:4), uynu(nssq) dimension uz(nssq,0:4), uznu(nssq) dimension fair(nssq,0:4), fairnu(nssq)cThis is the declaration section for the Fortran-I expression of the whole hydro calculation. Finally, we insert barrel shifts of indices for the resulting revolving buffers of grid planes.

We must use parameters, so that the compiler will see constants.

We copy the main memory data in DD into the cache-resident D on our stack.

We unpack D one plane at a time into the separate revolving buffers of 5 planes each: rho, p, ux, uy, uz, fair.Our final results are for one plane only, and these we transpose and place in DNU.

Fortran-W and Fortran-I:The transformation from Fortran-W to Fortran-I is not

the same as the transformations I explained to the compiler group at Rice (& to SGI-MIPS, & later HP) in 1998 for PPM and sPPM, nor is it the same as the transformations I explained to them for Wilhelmson’s tornado code in the early 2000s.

But this transformation is related, and is updated for use by SIMD engines working out of cache memories.

The result is 2 to 3 times faster than any other expression I have found.

It is roughly 6 to 10 times faster, averaged over a whole code, than Fortran-W.

But Fortran-W is a delight to write, debug, & maintain.

Why can’t or won’t vendors do these transformations?The F-W to F-I transformation is global.It forces changes in your entire program.No compiler will ever do this for you.How can I do this for you?

Well, I can’t, but my students can.Trick #1:

You wrote your code according to Gittings’ Law.So your modules accept strings of cells.In any order.You can therefore accept any restructuring.

Trick #2:I write a new service layer.I get my colleague & students to implement a new Fortran array data type. This overcomes last barrier.

A New Fortran Array Data Type:Let’s face it, in the machine everything is just bits.All a Fortran multi-D array is is a rule for associating a

string of indices with an offset in bytes from a base.We can, if I provide a precompiler, make up any rule for

this association that we want.We already convert our Fortran to C, so we already

linearize all our multi-D arrays.So it is really no extra trouble to linearize them according

to a different rule.How about this?DD(j,k,i,ivar,iage) ――—→

DD(jj,kk,ii,ivar,ibq,jbq,kbq,iage) ――—→DDfi(jj+4*((kk-1)+4*((ii-1)+4*((ivar-1)+nvars*((ibq-1) +nbqx*((jbq-1)+nbqy*((kbq-1)+nbqz*(iage-1)))))))

A New Fortran Array Data Type:This is really not so hard:DD(j,k,i,ivar,iage) ――—→

DD(jj,kk,ii,ivar,ibq,jbq,kbq,iage) ――—→DDfi(jj+4*((kk-1)+4*((ii-1)+4*((ivar-1)+nvars*((ibq-1) +nbqx*((jbq-1)+nbqy*((kbq-1)+nbqz*(iage-1)))))))

All the precompiler has to do is first insert:jbq = 1 + (j − 1) / 4jj = j − 4*(jbq−1)followed by DD(jj,kk,ii,ivar,ibq,jbq,kbq,iage)whenever it encounters DD(j,k,i,ivar,iage).

You might think this would lead to bad code, but this bad code is just the kind of thing that vendor compilers simply LOVE to fix. They are really good at that, although they can’t do much else except register alloc.

A New Fortran Array Data Type:Now that you got that, how about this?

parameter (nsugar=4)parameter (nx=31*nsugar)parameter (ny=7*nsugar)parameter (nz=11*nsugar)dimension rho(nx,ny,nz,2), p(nx,ny,nz,2)dimension ux(nx,ny,nz,2), uy(nx,ny,nz,2), uz(nx,ny,nz,2)

cPPM$ packedbriquettearray DD(nsugar,nsugar,nsugar)cPPM& dimension DD(nx,ny,nz,(rho:p:ux:uy:uz:fair),2)

Now, this is really no harder for us than the previous case, but it is immensely more useful for you.

What could it possibly mean? Consider this:dimension ei(nx,ny,nz)do k = 1,nzdo j = 1,nydo i = 1,nx ei(i,j,k) = 1.5 * p(i,j,k,2) / rho(i,j,k,2)enddoenddoenddo

The translation of this loop is not trivial, but it is straightforward. It is given on the next slide.

A New Fortran Array Data Type:The translation into standard Fortran that you desire is:

kb = -4do kbq = 1,11kb = kb + 4jb = -4do jbq = 1,7jb = jb + 4ib = -4do ibq = 1,31ib = ib + 4do kk = 1,4k = kb + kkdo jj = 1,4j = jb + jjdo ii = 1,4 i = ib + ii ei(i,j,k) = 1.5

& * DDfi(ii+4*((jj-1)+4*((kk-1)+4*(1+6*((ibq-1)+31*((jbq-1)+7*((kbq-1)+11)))))) & / DDfi(ii+4*((jj-1)+4*((kk-1)+4*(6*((ibq-1)+31*((jbq-1)+7*((kbq-1)+11))))))

enddo[5 more enddos]

The vendor compiler will just eat this up. It will be so grateful that you gave it gobs of common subexpressions to remove and constants to be lifted out of loops. It will just love you for this.

Why no vendor will ever give you this:All of this looks perfectly simple, until you ask me if you

can do this:call whatever(rho(13,27,3,2),result)

This would simply be unreasonable of you.Just imagine what hoops I would have to go through to give you this.Why, you could do just about anything inside whatever.I won’t allow you to do this. Not on your life!Not for what you pay me.Not even for twice that.Well, . . .

This exposes a critical issue with precompilation.It is not just what we do for you. You must also do something for us.We need to negotiate a set of rules that limit what expressions you write.These rules will make the construction of the precompiler possible,

at least by my students. And in a time less than the age of the universe

What does a Vendor Give you Right Now?We take a Fortran-W expression of single-fluid PPM:This reads 6 variables per grid cell and writes 6, after doing 787 flops. This is 66

flops/word. Not as good as LinPack, but then we actually need to do these flops for a practical result.

If we go to a more robust version of PPM, which handles shocks above Mach 2, we then need to read 20 more words, while not doing many more flops (about 130 more).

The computational intensity is then only 29 flops/word.This will not run as well (it is slower by about a third) no matter how it is expressed, because no existing CPU, to my

knowledge, is able to really overlap accesses to main memory with computation.This is something the industry could fix, if you ask them, and put $ behind it.

We will run a shear instability problem on a uniform grid of 1283 cells.We will run this for 400 time steps, dumping data for 3-D visualization every

40 time steps. We will include the cost of dumping this data in our perf. #s.Fortran-W: 8 cores each update 1/8 of the domain in parallel via OpenMP.

First-touch initialization makes this more efficient than 8 MPI ranks.Intel Fortran version 12, the very latest, gives 0.796 Gflop/s/core.

Fortran-I: 16 threads update the entire domain, divided into 8 bricks. All 8 cores cooperate on each brick. Intel Fortran version 12 gives 3.66 Gflop/s/core. Intel Fortran version 9.1 gives 7.88 Gflop/s/core. If we do not include the cost of the output, this performance rises to 8.33 Gflop/s/core. A factor of 10!

Why is Fortran-W Performance so Poor?Consider the Fortran-W expression of single-fluid PPM: adiff = a(j,k,i+1) - a(j,k,i-1) azrdif = 3. * dal(j,k,i) - dal(j,k,i-1) - adiff azldif = dal(j,k,i+2) - 3. * dal(j,k,i+1) + adiff ferror = .5 * ( abs (azldif) + abs (azrdif) ) / 1 (absdal(j,k,i) + absdal(j,k,i+1) + smalla) unsmooth = min (1., max (0., ferrfc * (ferror - crterr))) unsmth(j,k,i) = max (unsmooth, unsmth(j,k,i)) daasppm = .5 * (dal(j,k,i) + dal(j,k,i+1)) s(j,k,i) = 1. if (daasppm .lt. 0.) s(j,k,i) = -1. damax = 2. * min(s(j,k,i)*dal(j,k,i), s(j,k,i)*dal(j,k,i+1)) damon = min (s(j,k,i)*daasppm, damax) damon = s(j,k,i) * max (damon, 0.) smooth = 1. - unsmth(j,k,i) damnot(j,k,i) = smooth * daasppm + unsmth(j,k,i) * damon

This is that loop body from the PPM interpolation routine. It is typical.It reads in a, dal, unsmth, and it writes out unsmth, s, damnot.So it has half the inputs and half the outputs of the entire PPM algorithm, but

it performs only (count them) 16 adds, 13 multiplies, and 1 reciprocal.Counting the reciprocal as 3 flops (the classic Cray standard), we get only

5.3 flops/word. This intensity is more than 10 times less than PPM’s.No matter how hard the compiler works on this, this loop by itself is simply

hopeless. To get speed, we MUST combine this calculation with others that use the same data. This is the key to our transformed code’s performance, but short, perfectly aligned operands also contribute.

What about GPUs?Consider the Fortran-W expression of single-fluid PPM: adiff = a(j,k,i+1) - a(j,k,i-1) azrdif = 3. * dal(j,k,i) - dal(j,k,i-1) - adiff azldif = dal(j,k,i+2) - 3. * dal(j,k,i+1) + adiff ferror = .5 * ( abs (azldif) + abs (azrdif) ) / 1 (absdal(j,k,i) + absdal(j,k,i+1) + smalla) unsmooth = min (1., max (0., ferrfc * (ferror - crterr))) unsmth(j,k,i) = max (unsmooth, unsmth(j,k,i)) daasppm = .5 * (dal(j,k,i) + dal(j,k,i+1)) s(j,k,i) = 1. if (daasppm .lt. 0.) s(j,k,i) = -1. damax = 2. * min(s(j,k,i)*dal(j,k,i), s(j,k,i)*dal(j,k,i+1)) damon = min (s(j,k,i)*daasppm, damax) damon = s(j,k,i) * max (damon, 0.) smooth = 1. - unsmth(j,k,i) damnot(j,k,i) = smooth * daasppm + unsmth(j,k,i) * damon

This is that loop body from the PPM interpolation routine. It is typical.This is just the sort of “streaming” code expression Nvidia’s advertising claims

the GPU was built for.It could be immediately transliterated into CUDA.At 5.3 flops/word, it would be completely limited by the memory bandwidth.The Nvidia Fermi card has 5 times the memory bandwith of a single CPU,

and it has 7 times the number of SIMD engines as a quadcore Intel Nehalem CPU (4 times wider, with 3 times slower clocks).

Most of those cores are wasted on this loop, but the memory bandwidth could get the performance up to 75 GB/sec, about the most we have ever seen on the Fermi card. This is a little under 20 Gwords/sec, which for our loop (assuming flops are free) corresponds to 106 Gflop/s, equal to 13 Intel Nehalem cores (or 2 Westmere CPUs) running our transformed code.

What’s so Great about GPUs?GPUs and CPUs are converging as we speak. At the moment:The Nvidia Fermi card has 5 times the memory bandwith of a single CPU,

and it has 7 times the number of SIMD engines as a quadcore Intel Nehalem CPU (4 times wider, with 3 times slower clocks).

An Intel Westmere CPU has 6 “Nehalem-style” cores, and the server chip has 10 of these cores. These chips first appeared about the same time as Nvidia’s Fermi, and they are much less expensive.

A dual-CPU Intel Westmere PC workstation (we have 6 in our lab) costs about $6000, about the same as one Fermi card plus a host with adequate memory, networking, etc. So this is an apples-to-apples comparison.

Assuming that the Fermi card could run single-fluid PPM at 106 Gflop/s, which we know it cannot (we have never seen it go above 89 Gflop/s on any part of PPM), this would need to be compared to 12 Westmere cores running single-fluid PPM at about 8 Gflop/s/core, for a total of about 96 Gflop/s.

Thus the GPU-accelerated workstation and the normal workstation are just about neck and neck in cost and performance for hydrodynamics.

The GPU has a memory bandwidth advantage and is over-provisioned with computing capability that it simply cannot use.

The CPU has an on-chip memory advantage which completely offsets Fermi’s bandwidth capability – for hydrodynamics – at the same price.

Conclusion?If main memory bandwidth was your irreducible problem, then

the GPU is for you.But, with our code precompiler, we eliminate this problem and

replace it once more with a computing capability problem.So the CPU is for you.But you could go either way, and why not?The Fortran-W expression is a natural for translation into CUDA.It can also be translated into a fully pipelined, fully aligned

Fortran-I expression for a SIMD engine with a cache.We are building both translation capabilities, so you will have

your bets completely hedged.Whether GPUs, as distinct devices, are in HPC to stay is not clear.Nvidia is putting more memory onto their chips.Intel is making their SIMD engines much wider (8-wide today with

Sandy Bridge, and 16-wide next year with MIC).This looks like convergence to me. How does it look to you?

How do I Interpret the Hype?GPU advocates will be glad to tell you their speed-ups, but they

are much more reluctant to tell you their speeds.In comparison to a single core of a single CPU running a code

expressed in Fortran-W, it is no surprise to achieve, say, a speed-up of 22×, a CFD speed-up recently claimed by Nvidia.

After all, the Fermi card has 28 cores, so 22× is not astounding.On both devices, such a code is bound to be memory bandwidth

limited. This is inevitable on a GPU but NOT on a CPU.For example, single-fluid PPM, expressed in Fortran-W, runs at

0.79 Gflop/s/core on an Intel Nehalem CPU, exploiting its vector processing capability but not exploiting its cache.

22 × 0.79 = 17.4 Gflop/s. Voila!For our multifluid PPM code, we recently measured 14.7 Gflop/s

on Nvidia’s Fermi card. Not so different. So this checks out.But our code transformations give an acceleration on the CPU of

10× over Fortran-W. Then a Nehalem CPU runs at 32 Gflop/s.In Nvidia parlance, this is 40× (4 cores and 10× per core).

What if someone gives you a GPU cluster?Well, would you refuse a CPU cluster?Then, accept.Express your code in Fortran-W, which is ideal for a GPU.Obey the negotiated rules of our code precompiler.Let it generate CUDA for you.But also let it generate extremely fast CPU code too, and run that

on everything else.You are completely covered.You can have it both ways!

The point here is that good code for a GPU is bad code for a CPU, and vice-versa. So, if you just code for one, you lose on the other. OpenCL does not fix that. It just lets you run well on one and badly on the other.

Our precompiler solves this dilemma, by delivering good code on both devices via precompilation from a very special source.

Now let’s get beyond a single core and a single node:We will use domain decomposition.For now, we will work with just 3 grid levels.In future, it is easy to add more coarse levels, but we will keep

the maximum possible size of a fully refined grid brick at 1283.On the coarsest level, then, we have bricks of 323 cells.On average, we would expect this to produce, with some refined

subregions, about 643 grid cells. A CPU with 8 cores can update a brick of this size very efficiently, using our tricks.

The largest this could ever be is 1283 cells. This is very efficient for 8 cores to update. It is also relatively small, so we can have 8 of these easily residing at a node.

We will use a single time step for all grid levels. Essentially all the grid is at the finest level in most problems, so this has no real cost in a practical calculation.

Thus we will have grid bricks that range in update cost by a factor of 8 up or down from the average. This we can handle.

First, let’s think about a single grid brick:We will provide enough ghost cell data to enable a full update of

the grid brick without any further messaging.My plan is to augment the grid brick with ghost data by one ghost

grid briquette in the 2 dimensions transverse to this 1-D pass and by 2 ghost briquettes in the dimension along this 1-D pass.

These slab-shaped augmented regions will thus permit us to produce:• 2 cells at my refinement level in each transverse direction.• 4 cells at my refinement level in the direction of the pass.

This is just enough for the most elaborate, multifluid PPM now.This should be more than enough for most algorithms.If more is needed, one can implement that algorithm as multiple

update tasks, which will be managed by the run-time system.This ghost data will be provided by the messaging strategy shown

on the next set of slides.

Each grid brick is augmented in X by a face message.

Each grid brick is augmented in Y by a face message that is itself augmented in X.

Each grid brick is augmented in Z by a face message that is itself augmented in both X and Y.

X-face messages are sent first, used on receipt to augment Y-face messages, then these are sent, and used on receipt to augment Z-face messages. In this way, the messaging topology for each brick is kept relative-ly simple. All these messages may contain refined regions, only the coarsest grid level is shown here for simplicity. Messages are one coarse briquette thick.

How can I implement these 3 episodes of messaging?If I placed only a single grid brick at each node, I would have a

problem.Then I would have to:

1. send off my X-face messages,2. wait for my X-neighbors’ X-faces to arrive,3. augment my Y-face messages and send them off,4. wait for my Y-neighbors’ augmented Y-faces to arrive,5. augment my Z-face messages and send them off,6. wait for my Z-neighbors’ augmented Z-faces to arrive, and7. get on with my life as a brick.

I could do this, but this would be a lot of waiting.If, however, I have more bricks to update, I can be doing that

while I wait on all these messages.If I have 8 grid bricks to udpate, chances are that I will never be at

a loss for some real work to do while all the messages are in flight. On a uniform grid, I can arrange this ideally & never wait, by having my 8 bricks fit together to make up a big brick.

This looks fine for a Uniform Grid, what about AMR?We would like our AMR code to run nearly as fast as a uniform

grid code in the rare event that the grid actually is uniform.If we adopt a near-optimal messaging strategy for the uniform

case, and extend it to the AMR case, we have a chance at this.This messaging is relatively simple.Each brick only needs to know the locations of its 6 face

neighbors. It talks to 26 neighbors, but reaches 20 of them only indirectly. This involves a cost in timing and synchronization, but all messages are large, and hence efficient. Each brick only knows what it needs to know, which is not much.

The key to this messaging simplicity is geometry.To accomplish this simplification, I must update only bricks that

have perfectly smooth faces and that are rectangular solids.If I refine only entire grid briquettes, I can satisfy this constraint.This constraint is absolutely not limiting in its power to describe a

complex flow with included material boundaries and shocks.

Load Balancing?You have probably guessed that the down side of simple

messaging is load balancing.My brick could range from 1024 cells to over 2 million cells.This is a big range.It would take 2000 little bricks to equal one big one.Solution:Each node must, on average, update lots of grid bricks.We will say 8, on average, for a start.We will do the big bricks first, but even so, some nodes may finish

early. That’s AMR for you. Work is reduced, so you finish early.

But our early finishers will not go to the beach, as it were.We will have them go right on with the next round of updates.Nothing, fundamentally, prevents this.They cannot get infinitely far ahead, of course, but if they see this

condition coming, they will ask for more work to do.

Task Lists:The key concept in dynamic load balancing is the task list.To balance my loads dynamically, I have to know what is

happening. Not only what work there is, but who is doing it and whether or not they are good at that (as measured by how long it is taking).

We will assume that all work is done perfectly, or not at all.The task list is a list of things to do, with qualifiers for each task.We will assume that our list of tasks is a revolving list, because

we will just do this same stuff over and over and over again.We can therefore use our experience with these tasks to make

intelligent task assignments.Our task list will be passive. It is just a list, not a person.Any worker can access and modify this list, after first locking

access to it, of course.This would never work in your research group, but here every

worker is identical, and all share the common goals – and all are running the identical task management program.

Task Attributes:1. Last round for which this task was completed.2. Time it took on the last round. Negative if done on other node.3. Who did the task on the last round. Worker number or −node.4. Node that supplied the data for this task on the last round.5. Worker to which this task is presently assigned, -1 is none.6. Time when this task was assigned, -1. for not assigned.7. Size of this task – for example, number of cells to update.8. Identities of the nodes holding the data for the 6 face neighbors9. Ordinal number of this task – for example coordinates within

and octobrick consisting of 8 adjoining bricks.10. X-message arrival status.11. Y-message arrival status.12. Z-message arrival status.13. Task readiness – 1 for ready, 0 otherwise.The task list is an ordered list. It is recommended that the tasks be

launched in this order, although that is not a constraint.

Task Interdependence:The readiness of a given task can depend upon the completion of

other tasks on the list.Message dispatch and arrival is a simple example.The update and X-face message dispatch for a brick is one task.

Also mark X-face message receipt task as “ready.”The X-face message receipt and Y-face message augmentation and

dispatch is another task. Mark Y-messaging task as “ready.”The Y-face message receipt and Z-face message augmentation and

dispatch is another task. Part of this task must be to mark the next update task for this brick as “ready.”

It is clear that the messaging tasks are short, but time-critical.So we should choose to execute them, despite the general plan of

doing big update tasks before small ones.If we take up these tasks immediately as they become ready, then

we will wait a lot. Another strategy is to demand that he who updates messages, but to also demand the he take on some other work to cover all the time otherwise spent waiting.

Task Assignment Rules:Some rules can make intelligent task assignments easier.Here we discuss the task list at a network node, which is assumed

to have a shared memory and multiple workers, each of which is a separate, multi-threaded MPI rank.

• The owner of a grid update task also owns the 3 subsequent messaging tasks for that same brick, after which the next update task for the brick is marked as ready.

• Each worker, upon inspecting the task list while it is locked from others, has the responsibility to reorder it based upon the information it contains, so that the first task on the list is the one most favored for adoption by the next worker.

• At the beginning of a new 1-D pass, the first task must be to fetch to this node a copy of an additional grid brick from elsewhere, so that its data can be available if needed at the end of the pass, when this node may run out of useful work to do.

• Forced idleness must be reported to the team leader process that is coordinating work for this team of nodes.

Task Assignment Order:Rules for task assignment order cannot be perfect, but they can be

helpful in avoiding idleness.• We want to get the big tasks over with first, so that we can fill

in the remaining time with little tasks.• The bigness of a task is a combination of its official “size” and

how long it took to perform on the last round.• We want to do brick updates for which neighbors exist off of

this node before ones that have all neighbors on this node, if all else is equal.

• An X-neighbor off-node creates more priority than a Y- or Z-neighbor off-node.

• There is an optimal update order for octobrick octants. We could consider following this pattern even if the locations of these octants get all mixed up.

• The best task ordering rules is a subject of research. The above are just ideas. We will find out what works best over time.

Dynamic Load Balancing:You

Each grid brick is augmented in Y by a face message that is itself augmented in X.


Updating each of the 4 refined grid pencils becomes a separate (independent) task for a single SIMD engine. The 2 remaining coarse grid pencils also become separate update tasks.

The pipelining as shown requires a string of briquettes all at the same grid refinement level.

How can I then do AMR this way?



Let’s start small and work outward. This is a grid cell.

We will subdivide it evenly into a grid briquette.

Runs shown here all have A1 = 0 and no

Runs shown here all have A1 = 0 and no disturbance on the outer surface of the dense shell in the initial state.

All runs shown here have = 12 x

We show the density distribution in the region inside that circle at which we apply our boundary condition.

This circle begins with radius 13 and moves inward with time at the constant velocity 2.6

The disturbance for mode 5 has an initial amplitude of 1% of the wavelength.

The disturbance for mode 47 has an initial amplitude of 9.4% of the wavelength.

a strategy to move to exascale building on experience on roadrunner with multifluid ppm paul...

Documents