loffeld_siamcse15
TRANSCRIPT
LLNL-PRES-668437
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
High Performance of Finite-Volume Methods
through Increased Arithmetic IntensitySIAM CSE 2015
J. Loffeld and J.A.F. Hittinger
3/17/2015
Lawrence Livermore National Laboratory LLNL-PRES-6684372
To get high flops rate, you need high arithmetic
intensity
1024
512
256
128
64
32
16
Perf
orm
ance (
GF
lop/s
)
321684211/2
Arithmetic Intensity (flop/byte)
Machine peak
Ma
ch
ine
ba
lan
ce
No FMA
No AVX
Low-order
PDE Stencils
FFTs Dense Matrix Multiply
Gre
ate
r concurre
ncy
Increasing order
of FV methods
improves AI
Lawrence Livermore National Laboratory LLNL-PRES-6684373
Higher-order finite-volume methods require
high-order flux approximations
Update formula for
conservation laws:
• We are considering AI for one
time step
Approximating the flux
averages gives FV method
• High-order approximations give
high-order method
High-order flux approximations
use more flopsFlux update stencil
Lawrence Livermore National Laboratory LLNL-PRES-6684374
High-order flux approximations include more
neighbor information
Eighth-order central flux:
Incorporating information from neighbors
gives high flop count
Derived upwind and central high-order
schemes for 5th through 8th order
[McCorquodale, et al. CAMCS (2011)],
[Colella, et al. J. Comput. Phys. (2011)]
Lawrence Livermore National Laboratory LLNL-PRES-6684375
AI is most easily calculated in the limit of
infinite cache size
Assume unlimited cache
• Useful for later refinement
• Target AI
Load and store data only
once per cell
Temporaries between
stencils absorbed by
cache
Re-use of data allows
high AI
Lawrence Livermore National Laboratory LLNL-PRES-6684376
Theoretical maximum AI reaches target for
sixth and eighth order
Modern machine balance
We would see these results in practice
if machines had infinite cache space
Flops for example step:
𝑐 8𝐷 − 1
2+ 1 𝐷(𝑁 + 1)(𝑁 + 2)𝐷−1
Formulas parameterized by
• Dimension
• Number of components
• Domain size
• Flops cost of flux function
Lawrence Livermore National Laboratory LLNL-PRES-6684377
In reality, machines have finite-size caches
Overhead from re-
fetching halo cells
• Sixth order halo width is 4
• Eighth order halo width is 6
• Halo cells limit minimum
block size
Each block stores
values per component
Lawrence Livermore National Laboratory LLNL-PRES-6684378
Vulcan – IBM Blue Gene Q
• 32MB L2 cache (last level)
• Cache line is 128 bytes
BGPM for hardware counters
• Flops counts are highly accurate
• Overcount DRAM transfers
— Turn off prefetching
— Overhead from API
— Get aliasing error from large cache line
— Random noise
To verify the predictions, we used the hardware
counters on Vulcan
Machine
peak
Ma
ch
ine
ba
lan
ce
4.8 flop/byte
205
Gflop
/s
Lawrence Livermore National Laboratory LLNL-PRES-6684379
Measured AI with ND cache blocking compares
well to theory
Modern machine balance
• Higher order methods have wider stencils
• Blocks need wide halos
• Less efficient cache reuse
Fourth theoretical
Fourth measured
Sixth theoretical
Sixth measured
Eighth theoretical
Eighth measured
Lawrence Livermore National Laboratory LLNL-PRES-66843710
Because of halo, 3D blocking requires too
much cache space
Need block length about 32
to keep overhead modest
• For eighth-order, 1.55x
• For sixth-order, 1.34x
Each block requires
For 5-component system
(e.g. Euler), need 5 MB
cache per 32-wide block
• Current processors have ~2 to
2.5 MB/core
On 1283 size domain
Lawrence Livermore National Laboratory LLNL-PRES-66843711
However, vertical iteration of rectangular cache
blocks can improve cache usage
Successively evaluate
blocks in columns
• No re-fetching of halo in z
direction
Storage per block:
For 8 × 322 blocks in 1283
size domain:
Order Overhead Size AI
6 1.21x 1.5MB 13.6
8 1.33x 2.1MB 21.8 High AI with realistic cache size
Lawrence Livermore National Laboratory LLNL-PRES-66843712
Derived high-order finite-volume schemes
Conducted AI analysis that shows high AI can be
obtained with realistic cache sizes
Summary:
Machine
peak
Ma
ch
ine
ba
lan
ce
Current and future work:
AI is an important metric for on-node utilization, but it
does not equal performance
• Latency, concurrency, cache blocking
• [Olschanowski et al. SC (2014)] for 4th order
Need to consider ways to reduce halo width to
further reduce overhead
Include nonlinear limiting in the flux AI analysis
• Will further increase ops without increasing data
transfers