phd student stanford...
TRANSCRIPT
![Page 1: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/1.jpg)
Abdulrahman Manea
PhD Student
Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science
Energy Resources Engineering Department
School of Earth Sciences Stanford University
1
![Page 2: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/2.jpg)
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
2
![Page 3: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/3.jpg)
Reservoir Simulation (Black Oil):
Mass Conservation of Component α:
Incompressible:
Total Balance:
Incompressible Pressure Equation:
Solver is the most computationally expensive component
Unknowns have varying nature Pressure (elliptic) vs. Saturation (Hyperbolic)
Multistage preconditioning scheme Constraints Pressure Residual (CPR)* CPR with Multigrid as the first stage: very robust and widely used scheme
* Wallis, J.R., et al. SPE 13536 (1985) 3
Aramco’s GigaPOWERS
![Page 4: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/4.jpg)
Objective
Design and Implement a massively Parallel Reservoir
Simulation Multigrid on GPU Architectures
Plan
1. Implement an optimized serial version of Multigrid to
have a reasonable serial performance baseline
2. Design and implement a parallel version of Multigrid
that harnesses the power the massively parallel GPU
architectures
4
![Page 5: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/5.jpg)
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
5
![Page 6: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/6.jpg)
Descretized equation is
𝐴𝑓𝑥𝑓 = 𝑏𝑓
Basic 2-Level Multigrid Algorithm (3 steps)
1. The Pre-smoothing Step
𝑥𝑓 ← 𝑠𝑚𝑜𝑜𝑡ℎ 𝐴𝑓, 𝑏𝑓, 𝑥0, 𝜐1
2. The Coarse-Grid Correction Step
𝑟𝑓 = 𝑏𝑓 − 𝐴𝑓𝑥𝑓
𝑟𝑐 = 𝐼𝑓𝑐𝑟𝑓
𝑒𝑐 = 𝐴𝑐 −1𝑟𝑐
𝑒𝑓 = 𝐼𝑐𝑓
𝑒𝑐
𝑥𝑓 = 𝑥𝑓 + 𝑒𝑓 3. The Post-smoothing Step:
𝑥𝑓 ← 𝑠𝑚𝑜𝑜𝑡ℎ(𝐴𝑓, 𝑏𝑓, 𝑥𝑓, 𝜐2) * Brandt, A. (1977)
presmoothing postsmoothing
Solve the Problem on the
Coarse Grid
6
![Page 7: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/7.jpg)
I,J I+1,J I-1,J
I-1,J-1
I-1,J+1 I+1,J+1
I,J-1 I+1,J-1
I,J+1
i+1,j i-1,j
i,j-1
i,j+1 i+1,j+1 i-1,j+1
i-1,j-1 i+1,j-1
i-1,j+1 i,j+1 i+1,j+1
i+1,j i+1,j
i,j-1 i+1,j-1 i-1,j-1
i,j
𝑇𝑖,𝑗𝑛𝑤 𝑇𝑖,𝑗
𝑛 𝑇𝑖,𝑗𝑛𝑒
𝑇𝑖,𝑗𝑤 𝑇𝑖,𝑗
𝑒
𝑇𝑖,𝑗𝑠𝑤 𝑇𝑖,𝑗
𝑠
𝑇𝑖,𝑗𝑠𝑒
The prolongation and restriction operators’ weights depends on the PDE
discontinuous coefficients
𝛻 𝝀𝛻𝑝 = 𝑞
7
* Alcouffe, R.E., et al. (1981)
![Page 8: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/8.jpg)
Coarse grid operator: Manual Explicit handling of PDE on each coarser level
Automatic (Black Box Multigrid) ▪ Using grid transfer operators:
𝐴𝑐 = 𝐼𝑓𝑐𝐴𝑓𝐼𝑐
𝑓= (𝐼𝑐
𝑓)𝑇𝐴𝑓𝐼𝑐
𝑓
▪ No info. about coarser grid is needed
▪ Used in Algebraic multigrid
▪ Preserve operator symmetry
In Black Box Multigrid, two stages: Setup Stage:
▪ The interpolation, restriction and coarse grid operators are calculated.
Solution Stage: ▪ Carrying out the cycling process
Anisotropic PDE Coefficients Line Relaxation (2D) , plane relaxation (3D)
Semicoarsening *Dendy, J.E, (1982), (1986), Schaffer, S., (1998)
8
![Page 9: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/9.jpg)
To handle anisotropies in all three dimensions (x,y,z): Alternating plane relaxation (too expensive)
Semicoarsening with plane relaxation (cheaper) ▪ One plane-solve, and semicoarsening in the dimension orthogonal to that plane.
When semicoarsening approach is used, with exact grid transfer
operators, MG becomes a direct solver (i.e. a Schur Complement). However, grid transfer operators are not sparse
, where
A more efficient way is to “approximate” the exact grid transfer operators using a sparse (block diagonal) operator. 2D MG is used to define the components of the operator between every two
planes (details can be found in *)
*Schaffer, S., (1998)
9
![Page 10: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/10.jpg)
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
10
![Page 11: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/11.jpg)
Need a Multigrid solver capable of handling highly heterogeneous and anisotropic structured 2D reservoirs, thus:
2D Black Box Multigrid, with
Alternating line-relaxation
Testing Solver’s convergence behavior:
Test the convergence ratio for the same problem with varying sizes (using grid refinement)
Compare the performance with well-established and widely-used Multigrid
solvers, e.g.
▪ SAMG: Algebraic Multigrid Solver form Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)
▪ MGD9V, …etc
Test Models
▪ Geostatically Generated using the Stanford Geostatistical Modeling Software (SGeMS)
▪ Derived from SPE10 Comparative Solution Project Model.
▪ large permeability variations of 8 -12 orders of magnitude
11
![Page 12: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/12.jpg)
12
SPE10, Layer 70
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝐹𝑎𝑐𝑡𝑜𝑟 =𝑟𝑘+1 2
𝑟𝑘 2
![Page 13: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/13.jpg)
13
¼ Million Cell 1 Million Cell
Computational Time Comparison (SPE10 Layer 85 Refined to 1 Million Cells): • GMG: ~ 4.5 sec • SAMG: ~ 7.0 sec
![Page 14: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/14.jpg)
Parallelization of every component of the algorithm
Both setup stage and solution stage
Does not sacrifice algorithmic scalability (convergence rate)
Smoother
Alternating zebra-line relaxation
Effectively handles anisotropies
Coarsest Solve
4-color GS relaxation (to handle 9-point stencils)
14
![Page 15: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/15.jpg)
Shared-Memory Parallelization OpenMP
Coarse threads Hence coarse-scale parallelization
Multiple cells (multiple lines) per thread
Sparse Matrix Format
CSR for cache coherence
Tridiagonal Solver: Thomas Algorithm Serial within each line (i.e. thread)
but several lines are handled in parallel (zebra-coloring)
Architecture: 12 Intel ® Xeon ® X5650 2.66GHz cores with 48 GB Memory
15
![Page 16: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/16.jpg)
Fine Threads
Fine-scale parallelization
Single cell per thread
Sparse Matrix Format
Diagonal with column major ordering
Ideal for structured problems
▪ Coalesces memory accesses
▪ Minimizing storage requirements
▪ Exploits the banded matrix structure for efficient data access
Minimize expensive communication with host
Fit the whole problem on the GPU (up to 16M double precision)
16
![Page 17: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/17.jpg)
Tridiagonal Solver Parallel cyclic reduction (PCR) in Batch* to exploit:
▪ fine scale parallelism within the line
▪ coarse scale parallelism exposed by the zebra ordering of lines
Threads operates in two stages: ▪ Preparation Stage Solution Stage
For coalescing memory accesses during the Preparation Stage (NOTE: grid points are numbered along x-direction):
▪ In X-line Relaxation: Each x-line is assigned to a block of threads
▪ In Y-line Relaxation: Points with the same x-coordinate are assigned to a block of threads
*Using NVIDIA CUSPARSE Library: (https://developer.nvidia.com/cusparse) 17
y
5 21 22 23 24 25
4 16 17 18 19 20
3 11 12 13 14 15
2 6 7 8 9 10
1 1 2 3 4 5
1 2 3 4 5 x
coalesced
no
n-c
oa
lesc
ed
![Page 18: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/18.jpg)
Criteria Multicore GPU
Architecture
Specs
12 Intel ® Xeon ® X5650 2.66GHz
cores with 48 GB Memory
Nvidia Fermi-Based C2070 with 448
CUDA Cores and 6 GB Memory
Matrix Structure CSR Format for cache coherence Diagonal Format with column major
format
(for coalescing memory accesses)
Parallelization
API
OpenMP CUDA
Parallelization
Granularity
Multiple cells per thread
(coarse)
One cell per thread
(fine)
Tridiagonal
Solver Algorithm
(for line
relaxation)
Thomas Algorithm
(serial within each line, but multiple
lines are handled in parallel by
zebra coloring)
Parallel Cyclic Reduction in Batch
(Parallel within each line and multiple
lines are handled in parallel as well)
18
![Page 19: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/19.jpg)
Homogeneous Permeability Case: Solved with just one V(0,1) cycle
▪ Residual reduction by 109
Focuses on the scalability of the setup stage
Heterogeneous Permeability Case: Derived from SPE10 85th Layer by grid refinement
Solved with six V(0,1) cycles ▪ Residual reduction by 109
Focuses on the scalability of the solution stage
Problem Sizes: 1 Million, 4 Million and 16 Million cells
19
![Page 20: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/20.jpg)
20
![Page 21: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/21.jpg)
21
![Page 22: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/22.jpg)
22
![Page 23: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/23.jpg)
23
![Page 24: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/24.jpg)
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
24
![Page 25: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/25.jpg)
In reservoir simulation, z-direction
Huge variations due to natural deposition
Severe anisotropy compared to x/y directions
▪ An effect of discretization (pancake models).
Semicoarsening in z-direction, and plane relaxation in
the x-y plane
We can use 2D MG for both:
Setup Stage: construction of grid transfer operators
Solution Stage: x-y plane relaxation
25
![Page 26: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/26.jpg)
Parallelize plane solve kernel in both:
Setup Stage: construction of grid transfer operators ▪ Five V(0,1) cycles/plane for approximating an “exact solve”
Solution Stage: red/black plane relaxation ▪ One V(0,1) cycle/plane for doing plan-relaxation
Note that those 2D V(0,1) cycles are already parallelized
(using the 2D GMG algorithm explained earlier)
Other kernels are amenable to parallelization on the GPU, but are not tackled yet (under progress).
26
2D MG
for
Plane
Solve z
![Page 27: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/27.jpg)
Implementation: ▪ CPU: Use OpenMP threads to distribute the plane solves across multiple
cores
▪ GPU: Use CUDA with OpenMP to distribute the plane solves to multiple
GPU’s
Platform: ▪ CPU: 24 Intel(R) Xeon(R) CPU X5660 @ 2.80GHz with HT and 180
GB Memory
▪ GPU: 6 Nvidia Fermi-Based M2090’s
Test cases:
homogeneous and heterogeneous (SPE 10) with various sizes
Results:
average time for the plane solves for both setup and solution stages
27
![Page 28: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/28.jpg)
28
0
2
4
6
8
10
12
14
16
18
20
22
24 cores 1 GPU 2 GPU's 3 GPU's 4 GPU's 5 GPU's 6 GPU's
Spee
d U
p 16K x 129 ~ 2M cells
66K x 33 ~ 2M cells
1M x 17 ~ 18M cells
4M x 17 ~ 71M cells
![Page 29: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/29.jpg)
29
0
2
4
6
8
10
12
14
16
18
20
24 cores 1 GPU 2 GPU's 3 GPU's 4 GPU's 5 GPU's 6 GPU's
Spee
d U
p 16K x 129 ~ 2M cells
66K x 33 ~ 2M cells
1M x 17 ~ 18M cells
4M x 17 ~ 71M cells
![Page 30: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/30.jpg)
Planes need to be sufficiently large ( > 1M cells) for a
noticeable advantage
This is good for reservoir simulation, as grid refinement
studies are usually made by refining the horizontal planes.
Beyond 2-3 GPU’s, no performance is gained
Could be due to number of planes, or plane size..
Needs more investigation and profiling
30
![Page 31: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/31.jpg)
Accelerate other kernels of 3D
Semicoarsening Multigrid using GPU’s (such
as coarse operator construction, …etc)
Algebraic Multiscale Solver on GPU’s is
Next!
31
![Page 32: PhD Student Stanford Universityon-demand.gputechconf.com/gtc/2013/presentations/S3301-Massively... · PhD Student Hamdi Tchelepi ... Aramco’s GigaPOWERS ... This is good for reservoir](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a9e47457f8b9a21488e26db/html5/thumbnails/32.jpg)
Thank you for your listening
Questions
32