druinsky_siamcse15

Comparative Performance Analysis of anAlgebraic-Multigrid Solver on

Leading Multicore Architectures

Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li,Osni A. Marques, Eric Roman, Samuel Williams

Lawrence Berkeley National Laboratory

Andrew Barker, Panayot Vassilevski

Lawrence Livermore National Laboratory

Delyan Kalchev

University of Colorado, Boulder

What this talk is about

I Performance optimization, comparison and modeling of

I a novel shared-memory algebraic-multigrid solver

I using the SPE10 reservoir-modeling problem

I on a node of Cray XC30 and on a Xeon Phi.

How our multigrid solver works

Repeat until converged:

pre-smoothing y ← x + M−1(b − Ax)coarse-grid correction z ← y + PA−1

c PT (b − Ay)post-smoothing x ← z + M−1(b − Az)

How we construct the interpolator

=

SP P�

How we construct the coarse-grid matrix

=

Ac PAPT

What the spe10 problem is and how we are solving it

Credit: http://www.spe10.org

I oil-reservoir modeling benchmark problem

I solved using Darcy’s equation (in primal form)

−∇ · (κ(x)∇p(x)) = f (x) ,

where p(x) = pressure, and κ(x) = permeability

I defined over a 60× 220× 85 grid

I with isotropic and anisotropic versions

What the spe10 problem is and how we are solving it

I oil-reservoir modeling benchmark problem

I solved using Darcy’s equation (in primal form)

−∇ · (κ(x)∇p(x)) = f (x) ,

where p(x) = pressure, and κ(x) = permeability

I defined over a 60× 220× 85 grid

I with isotropic and anisotropic versions

What are the machines that we study?

Edison Babbage

name Ivy Bridge Knights Corner

model Xeon E5-2695 v2 Xeon Phi 5110P

clock speed 2.4 GHz 1.053 GHz

cores 12 60

SMT threads 2 4

SIMD width 4 8

peak gflop/s 230.4 1010.88

bandwidth 48.5 GB/s 122.9 GB/s

per-core caches:

L1-D 32 KB 32 KB

L2 256 KB 512 KB

shared cache:

L3 30 MB none

What the coarse-grid system is

n = 7,782; nnz = 1,412,840; nnz/n = 181.6

How we chose the preconditioner for PCG

preconditioner operator

Jacobi z = D−1rSymmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r

= + +

Ac L D LT


unprecond Jacobi SGS

conditioningisotropic 3.37× 104 1.35× 103 1.83× 102

anisotropic 9.68× 106 1.89× 104 2.91× 103

iterationsisotropic 605.53 194.57 78.87anisotropic 1,267.85 288.32 122.85


SGS Jacobi1 thread 1 thread 12 threads

time (s)isotropic 83.0 80.3 29.2anisotropic 128.6 121.6 43.8

Where does the AMG cycle spend most of its time?

1 2 4 8 16 32 60 120

32

64

128

256

512

1,024

2,048

smoothing

PCG

total

number of threads

runtime (s)

How to improve the performance of PCG

1: while not converged do2: ρ← σ3: omp parallel for: w ← Ap4: omp parallel for: τ ← w · p5: α← ρ/τ6: omp parallel for: x ← x + αp7: omp parallel for: r ← r − αw8: omp parallel for: z ← M−1r9: omp parallel for: σ ← z · r

10: β ← σ/ρ11: omp parallel for: p ← z + βp12: end while


1: omp parallel2: while not converged do3: omp single: τ ← 0.0 . implied barrier4: omp single nowait: ρ← σ, σ ← 0.05: omp for nowait: w ← Ap6: omp for reduction: τ ← w · p . implied barrier7: α← ρ/τ8: omp for nowait: x ← x + αp9: omp for nowait: r ← r − αw

10: omp for nowait: z ← M−1r11: omp for reduction: σ ← z · r . implied barrier12: β ← σ/ρ13: omp for nowait: p ← z + βp14: end while15: end omp parallel


1: omp parallel2: while not converged do3: omp for: w ← Ap4: omp single5: τ ← w · p6: α← ρ/τ7: x ← x + αp8: r ← r − αw9: z ← M−1r

10: ρ← σ11: σ ← z · r12: β ← σ/ρ13: p ← z + βp14: end omp single15: end while16: end omp parallel


1: while not converged do2: ρ← σ3: omp parallel for: w ← Ap4: τ ← w · p5: α← ρ/τ6: x ← x + αp7: r ← r − αw8: z ← M−1r9: σ ← z · r

10: β ← σ/ρ11: p ← z + βp12: end while


1 2 4 8 16 32 60 120

16

32

64

128

256

number of threads

runtime (s)

Algorithm 1

Algorithm 2

Algorithm 3

Algorithm 4

How the sparse HSS solver works

I sparse matrix-factorizationalgorithm

I represents the frontal matricesas hierarchically-semiseparable(HSS) matrices

I uses randomized sampling forfaster compression

D1

D2

D4

D5

D8

D9

D11

D12

U3B3V6H 7 B14

H

U6B6V3H

B

U7U3R3

U6R6=

More details in Pieter Ghysels’ talk tomorrow!

How do the parameters of the solver affect performance?

Parameter Values

coarse solver HSS, PCGelements-per-agglomerate 64, 128, 256, 512νP 0, 1, 2νM−1 1, 3, 5θ 0.001, 0.001× 100.5, 0.01

How do the parameters of the solver affect performance?

1%2%4%8%16%32%64%

8

16

32

64

128

percentile rank

runtime (s)

Babbage (HSS)

Babbage (PCG)

Edison (HSS)

Edison (PCG)

default configuration

What our performance model is

stage bytes flops

pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n)

restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp)

one coarse solve

multiply by Ac 12 nzc 2 nzc

preconditioner 2 · 8nc nc

vector operations 5 · 8nc 2 · 5ncinterpolation 12 nzp + 8n 2 nzp

stopping criterion 12 nza + 4 · 8n 2(nza + n)

What our performance model is

1 2 4 8 12

8

16

32

64

128

memory bound

flops bound

actual

number of cores

runtime (s)

Final comments

I HSS is an attractive option for solving coarse systems

I performance is quite sensitive to parameter tuning

I performance model indicates where the bottlenecks are

Thank you!

druinsky_siamcse15

Science