druinsky_siamcse15
TRANSCRIPT
Comparative Performance Analysis of anAlgebraic-Multigrid Solver on
Leading Multicore Architectures
Brian Austin, Alex Druinsky, Pieter Ghysels, Xiaoye Sherry Li,Osni A. Marques, Eric Roman, Samuel Williams
Lawrence Berkeley National Laboratory
Andrew Barker, Panayot Vassilevski
Lawrence Livermore National Laboratory
Delyan Kalchev
University of Colorado, Boulder
What this talk is about
I Performance optimization, comparison and modeling of
I a novel shared-memory algebraic-multigrid solver
I using the SPE10 reservoir-modeling problem
I on a node of Cray XC30 and on a Xeon Phi.
How our multigrid solver works
Repeat until converged:
pre-smoothing y ← x + M−1(b − Ax)coarse-grid correction z ← y + PA−1
c PT (b − Ay)post-smoothing x ← z + M−1(b − Az)
What the spe10 problem is and how we are solving it
Credit: http://www.spe10.org
I oil-reservoir modeling benchmark problem
I solved using Darcy’s equation (in primal form)
−∇ · (κ(x)∇p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
I defined over a 60× 220× 85 grid
I with isotropic and anisotropic versions
What the spe10 problem is and how we are solving it
I oil-reservoir modeling benchmark problem
I solved using Darcy’s equation (in primal form)
−∇ · (κ(x)∇p(x)) = f (x) ,
where p(x) = pressure, and κ(x) = permeability
I defined over a 60× 220× 85 grid
I with isotropic and anisotropic versions
What are the machines that we study?
Edison Babbage
name Ivy Bridge Knights Corner
model Xeon E5-2695 v2 Xeon Phi 5110P
clock speed 2.4 GHz 1.053 GHz
cores 12 60
SMT threads 2 4
SIMD width 4 8
peak gflop/s 230.4 1010.88
bandwidth 48.5 GB/s 122.9 GB/s
per-core caches:
L1-D 32 KB 32 KB
L2 256 KB 512 KB
shared cache:
L3 30 MB none
How we chose the preconditioner for PCG
preconditioner operator
Jacobi z = D−1rSymmetric Gauss–Seidel z = (L + D)−1D(L + D)−T r
= + +
Ac L D LT
How we chose the preconditioner for PCG
unprecond Jacobi SGS
conditioningisotropic 3.37× 104 1.35× 103 1.83× 102
anisotropic 9.68× 106 1.89× 104 2.91× 103
iterationsisotropic 605.53 194.57 78.87anisotropic 1,267.85 288.32 122.85
How we chose the preconditioner for PCG
SGS Jacobi1 thread 1 thread 12 threads
time (s)isotropic 83.0 80.3 29.2anisotropic 128.6 121.6 43.8
Where does the AMG cycle spend most of its time?
1 2 4 8 16 32 60 120
32
64
128
256
512
1,024
2,048
smoothing
PCG
total
number of threads
runtime (s)
How to improve the performance of PCG
1: while not converged do2: ρ← σ3: omp parallel for: w ← Ap4: omp parallel for: τ ← w · p5: α← ρ/τ6: omp parallel for: x ← x + αp7: omp parallel for: r ← r − αw8: omp parallel for: z ← M−1r9: omp parallel for: σ ← z · r
10: β ← σ/ρ11: omp parallel for: p ← z + βp12: end while
How to improve the performance of PCG
1: omp parallel2: while not converged do3: omp single: τ ← 0.0 . implied barrier4: omp single nowait: ρ← σ, σ ← 0.05: omp for nowait: w ← Ap6: omp for reduction: τ ← w · p . implied barrier7: α← ρ/τ8: omp for nowait: x ← x + αp9: omp for nowait: r ← r − αw
10: omp for nowait: z ← M−1r11: omp for reduction: σ ← z · r . implied barrier12: β ← σ/ρ13: omp for nowait: p ← z + βp14: end while15: end omp parallel
How to improve the performance of PCG
1: omp parallel2: while not converged do3: omp for: w ← Ap4: omp single5: τ ← w · p6: α← ρ/τ7: x ← x + αp8: r ← r − αw9: z ← M−1r
10: ρ← σ11: σ ← z · r12: β ← σ/ρ13: p ← z + βp14: end omp single15: end while16: end omp parallel
How to improve the performance of PCG
1: while not converged do2: ρ← σ3: omp parallel for: w ← Ap4: τ ← w · p5: α← ρ/τ6: x ← x + αp7: r ← r − αw8: z ← M−1r9: σ ← z · r
10: β ← σ/ρ11: p ← z + βp12: end while
How to improve the performance of PCG
1 2 4 8 16 32 60 120
16
32
64
128
256
number of threads
runtime (s)
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
How the sparse HSS solver works
I sparse matrix-factorizationalgorithm
I represents the frontal matricesas hierarchically-semiseparable(HSS) matrices
I uses randomized sampling forfaster compression
D1
D2
D4
D5
D8
D9
D11
D12
U3B3V6H 7 B14
H
U6B6V3H
B
U7U3R3
U6R6=
More details in Pieter Ghysels’ talk tomorrow!
How do the parameters of the solver affect performance?
Parameter Values
coarse solver HSS, PCGelements-per-agglomerate 64, 128, 256, 512νP 0, 1, 2νM−1 1, 3, 5θ 0.001, 0.001× 100.5, 0.01
How do the parameters of the solver affect performance?
1%2%4%8%16%32%64%
8
16
32
64
128
percentile rank
runtime (s)
Babbage (HSS)
Babbage (PCG)
Edison (HSS)
Edison (PCG)
default configuration
What our performance model is
stage bytes flops
pre- and post-smooth (3ν + 1)(12 nza + 3 · 8n) 2(3ν + 1)(nza + 2n)
restriction 12 nza + 12 nzp + 3 · 8n 2(nza + nzp)
one coarse solve
multiply by Ac 12 nzc 2 nzc
preconditioner 2 · 8nc nc
vector operations 5 · 8nc 2 · 5ncinterpolation 12 nzp + 8n 2 nzp
stopping criterion 12 nza + 4 · 8n 2(nza + n)
What our performance model is
1 2 4 8 12
8
16
32
64
128
memory bound
flops bound
actual
number of cores
runtime (s)
Final comments
I HSS is an attractive option for solving coarse systems
I performance is quite sensitive to parameter tuning
I performance model indicates where the bottlenecks are