phd student stanford...

Abdulrahman Manea

PhD Student

Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science

Energy Resources Engineering Department

School of Earth Sciences Stanford University

1

Introduction

Background

2D Black Box Geometric MG (GMG)

3D Semicoarsening Multigrid

Future Work

2

Reservoir Simulation (Black Oil):

Mass Conservation of Component α:

Incompressible:

Total Balance:

Incompressible Pressure Equation:

Solver is the most computationally expensive component

Unknowns have varying nature Pressure (elliptic) vs. Saturation (Hyperbolic)

Multistage preconditioning scheme Constraints Pressure Residual (CPR)* CPR with Multigrid as the first stage: very robust and widely used scheme

* Wallis, J.R., et al. SPE 13536 (1985) 3

Aramco’s GigaPOWERS

Objective

Design and Implement a massively Parallel Reservoir

Simulation Multigrid on GPU Architectures

Plan

1. Implement an optimized serial version of Multigrid to

have a reasonable serial performance baseline

2. Design and implement a parallel version of Multigrid

that harnesses the power the massively parallel GPU

architectures

4

Introduction

Background



Future Work

5

Descretized equation is

𝐴𝑓𝑥𝑓 = 𝑏𝑓

Basic 2-Level Multigrid Algorithm (3 steps)

1. The Pre-smoothing Step

𝑥𝑓 ← 𝑠𝑚𝑜𝑜𝑡ℎ 𝐴𝑓, 𝑏𝑓, 𝑥0, 𝜐1

2. The Coarse-Grid Correction Step

𝑟𝑓 = 𝑏𝑓 − 𝐴𝑓𝑥𝑓

𝑟𝑐 = 𝐼𝑓𝑐𝑟𝑓

𝑒𝑐 = 𝐴𝑐 −1𝑟𝑐

𝑒𝑓 = 𝐼𝑐𝑓

𝑒𝑐

𝑥𝑓 = 𝑥𝑓 + 𝑒𝑓 3. The Post-smoothing Step:

𝑥𝑓 ← 𝑠𝑚𝑜𝑜𝑡ℎ(𝐴𝑓, 𝑏𝑓, 𝑥𝑓, 𝜐2) * Brandt, A. (1977)

presmoothing postsmoothing

Solve the Problem on the

Coarse Grid

6

I,J I+1,J I-1,J

I-1,J-1

I-1,J+1 I+1,J+1

I,J-1 I+1,J-1

I,J+1

i+1,j i-1,j

i,j-1

i,j+1 i+1,j+1 i-1,j+1

i-1,j-1 i+1,j-1

i-1,j+1 i,j+1 i+1,j+1

i+1,j i+1,j

i,j-1 i+1,j-1 i-1,j-1

i,j

𝑇𝑖,𝑗𝑛𝑤 𝑇𝑖,𝑗

𝑛 𝑇𝑖,𝑗𝑛𝑒

𝑇𝑖,𝑗𝑤 𝑇𝑖,𝑗

𝑒

𝑇𝑖,𝑗𝑠𝑤 𝑇𝑖,𝑗

𝑠

𝑇𝑖,𝑗𝑠𝑒

The prolongation and restriction operators’ weights depends on the PDE

discontinuous coefficients

𝛻 𝝀𝛻𝑝 = 𝑞

7

* Alcouffe, R.E., et al. (1981)

Coarse grid operator: Manual Explicit handling of PDE on each coarser level

Automatic (Black Box Multigrid) ▪ Using grid transfer operators:

𝐴𝑐 = 𝐼𝑓𝑐𝐴𝑓𝐼𝑐

𝑓= (𝐼𝑐

𝑓)𝑇𝐴𝑓𝐼𝑐

𝑓

▪ No info. about coarser grid is needed

▪ Used in Algebraic multigrid

▪ Preserve operator symmetry

In Black Box Multigrid, two stages: Setup Stage:

▪ The interpolation, restriction and coarse grid operators are calculated.

Solution Stage: ▪ Carrying out the cycling process

Anisotropic PDE Coefficients Line Relaxation (2D) , plane relaxation (3D)

Semicoarsening *Dendy, J.E, (1982), (1986), Schaffer, S., (1998)

8

To handle anisotropies in all three dimensions (x,y,z): Alternating plane relaxation (too expensive)

Semicoarsening with plane relaxation (cheaper) ▪ One plane-solve, and semicoarsening in the dimension orthogonal to that plane.

When semicoarsening approach is used, with exact grid transfer

operators, MG becomes a direct solver (i.e. a Schur Complement). However, grid transfer operators are not sparse

, where

A more efficient way is to “approximate” the exact grid transfer operators using a sparse (block diagonal) operator. 2D MG is used to define the components of the operator between every two

planes (details can be found in *)

*Schaffer, S., (1998)

9

Introduction

Background



Future Work

10

Need a Multigrid solver capable of handling highly heterogeneous and anisotropic structured 2D reservoirs, thus:

2D Black Box Multigrid, with

Alternating line-relaxation

Testing Solver’s convergence behavior:

Test the convergence ratio for the same problem with varying sizes (using grid refinement)

Compare the performance with well-established and widely-used Multigrid

solvers, e.g.

▪ SAMG: Algebraic Multigrid Solver form Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)

▪ MGD9V, …etc

Test Models

▪ Geostatically Generated using the Stanford Geostatistical Modeling Software (SGeMS)

▪ Derived from SPE10 Comparative Solution Project Model.

▪ large permeability variations of 8 -12 orders of magnitude

11

12

SPE10, Layer 70

𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝐹𝑎𝑐𝑡𝑜𝑟 =𝑟𝑘+1 2

𝑟𝑘 2

13

¼ Million Cell 1 Million Cell

Computational Time Comparison (SPE10 Layer 85 Refined to 1 Million Cells): • GMG: ~ 4.5 sec • SAMG: ~ 7.0 sec

Parallelization of every component of the algorithm

Both setup stage and solution stage

Does not sacrifice algorithmic scalability (convergence rate)

Smoother

Alternating zebra-line relaxation

Effectively handles anisotropies

Coarsest Solve

4-color GS relaxation (to handle 9-point stencils)

14

Shared-Memory Parallelization OpenMP

Coarse threads Hence coarse-scale parallelization

Multiple cells (multiple lines) per thread

Sparse Matrix Format

CSR for cache coherence

Tridiagonal Solver: Thomas Algorithm Serial within each line (i.e. thread)

but several lines are handled in parallel (zebra-coloring)

Architecture: 12 Intel ® Xeon ® X5650 2.66GHz cores with 48 GB Memory

15

Fine Threads

Fine-scale parallelization

Single cell per thread

Sparse Matrix Format

Diagonal with column major ordering

Ideal for structured problems

▪ Coalesces memory accesses

▪ Minimizing storage requirements

▪ Exploits the banded matrix structure for efficient data access

Minimize expensive communication with host

Fit the whole problem on the GPU (up to 16M double precision)

16

Tridiagonal Solver Parallel cyclic reduction (PCR) in Batch* to exploit:

▪ fine scale parallelism within the line

▪ coarse scale parallelism exposed by the zebra ordering of lines

Threads operates in two stages: ▪ Preparation Stage Solution Stage

For coalescing memory accesses during the Preparation Stage (NOTE: grid points are numbered along x-direction):

▪ In X-line Relaxation: Each x-line is assigned to a block of threads

▪ In Y-line Relaxation: Points with the same x-coordinate are assigned to a block of threads

*Using NVIDIA CUSPARSE Library: (https://developer.nvidia.com/cusparse) 17

y

5 21 22 23 24 25

4 16 17 18 19 20

3 11 12 13 14 15

2 6 7 8 9 10

1 1 2 3 4 5

1 2 3 4 5 x

coalesced

no

n-c

oa

lesc

ed

Criteria Multicore GPU

Architecture

Specs

12 Intel ® Xeon ® X5650 2.66GHz

cores with 48 GB Memory

Nvidia Fermi-Based C2070 with 448

CUDA Cores and 6 GB Memory

Matrix Structure CSR Format for cache coherence Diagonal Format with column major

format

(for coalescing memory accesses)

Parallelization

API

OpenMP CUDA

Parallelization

Granularity

Multiple cells per thread

(coarse)

One cell per thread

(fine)

Tridiagonal

Solver Algorithm

(for line

relaxation)

Thomas Algorithm

(serial within each line, but multiple

lines are handled in parallel by

zebra coloring)

Parallel Cyclic Reduction in Batch

(Parallel within each line and multiple

lines are handled in parallel as well)

18

Homogeneous Permeability Case: Solved with just one V(0,1) cycle

▪ Residual reduction by 109

Focuses on the scalability of the setup stage

Heterogeneous Permeability Case: Derived from SPE10 85th Layer by grid refinement

Solved with six V(0,1) cycles ▪ Residual reduction by 109

Focuses on the scalability of the solution stage

Problem Sizes: 1 Million, 4 Million and 16 Million cells

19

Introduction

Background



Future Work

24

In reservoir simulation, z-direction

Huge variations due to natural deposition

Severe anisotropy compared to x/y directions

▪ An effect of discretization (pancake models).

Semicoarsening in z-direction, and plane relaxation in

the x-y plane

We can use 2D MG for both:

Setup Stage: construction of grid transfer operators

Solution Stage: x-y plane relaxation

25

Parallelize plane solve kernel in both:

Setup Stage: construction of grid transfer operators ▪ Five V(0,1) cycles/plane for approximating an “exact solve”

Solution Stage: red/black plane relaxation ▪ One V(0,1) cycle/plane for doing plan-relaxation

Note that those 2D V(0,1) cycles are already parallelized

(using the 2D GMG algorithm explained earlier)

Other kernels are amenable to parallelization on the GPU, but are not tackled yet (under progress).

26

2D MG

for

Plane

Solve z

Implementation: ▪ CPU: Use OpenMP threads to distribute the plane solves across multiple

cores

▪ GPU: Use CUDA with OpenMP to distribute the plane solves to multiple

GPU’s

Platform: ▪ CPU: 24 Intel(R) Xeon(R) CPU X5660 @ 2.80GHz with HT and 180

GB Memory

▪ GPU: 6 Nvidia Fermi-Based M2090’s

Test cases:

homogeneous and heterogeneous (SPE 10) with various sizes

Results:

average time for the plane solves for both setup and solution stages

27

28

0

2

4

6

8

10

12

14

16

18

20

22

24 cores 1 GPU 2 GPU's 3 GPU's 4 GPU's 5 GPU's 6 GPU's

Spee

d U

p 16K x 129 ~ 2M cells

66K x 33 ~ 2M cells

1M x 17 ~ 18M cells

4M x 17 ~ 71M cells

29

0

2

4

6

8

10

12

14

16

18

20

24 cores 1 GPU 2 GPU's 3 GPU's 4 GPU's 5 GPU's 6 GPU's

Spee

d U

p 16K x 129 ~ 2M cells

66K x 33 ~ 2M cells

1M x 17 ~ 18M cells

4M x 17 ~ 71M cells

Planes need to be sufficiently large ( > 1M cells) for a

noticeable advantage

This is good for reservoir simulation, as grid refinement

studies are usually made by refining the horizontal planes.

Beyond 2-3 GPU’s, no performance is gained

Could be due to number of planes, or plane size..

Needs more investigation and profiling

30

Accelerate other kernels of 3D

Semicoarsening Multigrid using GPU’s (such

as coarse operator construction, …etc)

Algebraic Multiscale Solver on GPU’s is

Next!

31

Thank you for your listening

Questions

32

phd student stanford...

Documents