loffeld_siamcse15

LLNL-PRES-668437

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

High Performance of Finite-Volume Methods

through Increased Arithmetic IntensitySIAM CSE 2015

J. Loffeld and J.A.F. Hittinger

3/17/2015

Lawrence Livermore National Laboratory LLNL-PRES-6684372

To get high flops rate, you need high arithmetic

intensity

1024

512

256

128

64

32

16

Perf

orm

ance (

GF

lop/s

)

321684211/2

Arithmetic Intensity (flop/byte)

Machine peak

Ma

ch

ine

ba

lan

ce

No FMA

No AVX

Low-order

PDE Stencils

FFTs Dense Matrix Multiply

Gre

ate

r concurre

ncy

Increasing order

of FV methods

improves AI


Higher-order finite-volume methods require

high-order flux approximations

Update formula for

conservation laws:

• We are considering AI for one

time step

Approximating the flux

averages gives FV method

• High-order approximations give

high-order method

High-order flux approximations

use more flopsFlux update stencil


High-order flux approximations include more

neighbor information

Eighth-order central flux:

Incorporating information from neighbors

gives high flop count

Derived upwind and central high-order

schemes for 5th through 8th order

[McCorquodale, et al. CAMCS (2011)],

[Colella, et al. J. Comput. Phys. (2011)]


AI is most easily calculated in the limit of

infinite cache size

Assume unlimited cache

• Useful for later refinement

• Target AI

Load and store data only

once per cell

Temporaries between

stencils absorbed by

cache

Re-use of data allows

high AI


Theoretical maximum AI reaches target for

sixth and eighth order

Modern machine balance

We would see these results in practice

if machines had infinite cache space

Flops for example step:

𝑐 8𝐷 − 1

2+ 1 𝐷(𝑁 + 1)(𝑁 + 2)𝐷−1

Formulas parameterized by

• Dimension

• Number of components

• Domain size

• Flops cost of flux function


In reality, machines have finite-size caches

Overhead from re-

fetching halo cells

• Sixth order halo width is 4

• Eighth order halo width is 6

• Halo cells limit minimum

block size

Each block stores

values per component


Vulcan – IBM Blue Gene Q

• 32MB L2 cache (last level)

• Cache line is 128 bytes

BGPM for hardware counters

• Flops counts are highly accurate

• Overcount DRAM transfers

— Turn off prefetching

— Overhead from API

— Get aliasing error from large cache line

— Random noise

To verify the predictions, we used the hardware

counters on Vulcan

Machine

peak

Ma

ch

ine

ba

lan

ce

4.8 flop/byte

205

Gflop

/s


Measured AI with ND cache blocking compares

well to theory

Modern machine balance

• Higher order methods have wider stencils

• Blocks need wide halos

• Less efficient cache reuse

Fourth theoretical

Fourth measured

Sixth theoretical

Sixth measured

Eighth theoretical

Eighth measured


Because of halo, 3D blocking requires too

much cache space

Need block length about 32

to keep overhead modest

• For eighth-order, 1.55x

• For sixth-order, 1.34x

Each block requires

For 5-component system

(e.g. Euler), need 5 MB

cache per 32-wide block

• Current processors have ~2 to

2.5 MB/core

On 1283 size domain


However, vertical iteration of rectangular cache

blocks can improve cache usage

Successively evaluate

blocks in columns

• No re-fetching of halo in z

direction

Storage per block:

For 8 × 322 blocks in 1283

size domain:

Order Overhead Size AI

6 1.21x 1.5MB 13.6

8 1.33x 2.1MB 21.8 High AI with realistic cache size


Derived high-order finite-volume schemes

Conducted AI analysis that shows high AI can be

obtained with realistic cache sizes

Summary:

Machine

peak

Ma

ch

ine

ba

lan

ce

Current and future work:

AI is an important metric for on-node utilization, but it

does not equal performance

• Latency, concurrency, cache blocking

• [Olschanowski et al. SC (2014)] for 4th order

Need to consider ways to reduce halo width to

further reduce overhead

Include nonlinear limiting in the flux AI analysis

• Will further increase ops without increasing data

transfers

loffeld_siamcse15

Science

order central flux

central highorder schemes

high performance of

high arithmetic intensity

order modern machine

high flop countderived

high flops rate

flux averages