large-scale reservoir simulation on gpu - gpu technology...

RESERVOIR SIMULATION

Large-Scale Reservoir Simulation on GPU

Song Yu, Hui Liu

Advisor: Dr. Zhangxing (John) Chen

University of Calgary


Outline •  Introduction

•  GPU-based Linear Solver

•  GPU-based Reservoir Simulation

•  Numerical Experiments

•  Conclusions


Introduction

•  Numerical method: FDM, FEM, FVM à matrix system •  A system matrix arising from simulation: sparse, highly nonsymmetric and ill-

conditioned.

•  The general choice: Krylov subspace solvers with preconditioners. •  Large-scale Reservoir simulation time: 80% -90% on solver •  Speed up linear solvers à speed up reservoir simulation


GPU Architecture (Tesla) D

RA

M I/

F H

OST

I/F

Gig

a Th

read

D

RA

M I/

F DRA

M I/F

DR

AM

I/F D

RA

M I/F

DR

AM

I/F

L2

GPU SM


GPU-based Linear Solver Package


GMRES Iterative algorithm used for solving linear system of equations in the form of Ax = b For an m*m matrix, GMRES guarantees convergence to the exact solution within m iterations. In reality, m is a very large number, so we use restart GMRES(m). GMRES converges after a small number of iterations when it is used in conjunction with a good preconditioner.

Main computational factor: •  BLAS operation: •  Matrix-vector product:

•  Preconditioning operation:

T

y x vector scalez x y dot producty x y saxpy

α

α

=

=

= +

y Ax=

Mr b=


Preconditioner The convergence rate of iterative linear solvers depends highly on the condition number of the matrix. Preconditioners are used to reduce the matrix condition number and speed up the convergence of iterative solvers. Ax = b à M-1Ax = M-1b M ≈ A ≈ LU Two criteria to choose M : 1: good approximation of A 2: easy to compute M-1 or solve Mx=b •  ILU is one of the most popular preconditioner families. Some non-zero elements in the L and U factors are ignored to reduce the cost

and the number of fill-ins. ILU has many varieties based on the level of fill-in. 1. no fill-in ILU: ILU(0), is the simplest one. In ILU(0), the lower and upper triangular matrices only keep non-zero elements, whose positions have non-zero elements in the original matrix. 2. fill-in ILU : ILUT with numerical threshold and ILUk with fill-in level k The more fill-in, the more time the factorization takes. It is a trade-off between accuracy and speed


Sparse matrix vector multiplication

•  Matrix: HEC format, Hybrid of ELL format and CSR format

J V Ap

J

V

ELL format CSR format

i

i i+1


GPU-based Linear Solver Package


GPU-based Reservoir Simulation

•  Conservation Equations –  Material Conservation –  Energy Conservation

•  Linear Solver –  Linear Solver, eg. GMRES, BICGSTAB, ORTHOMIN –  Non-Linear (Newtonian) Solver


Jacobian Matrix Example

nRRRRRRRRR

TspTspTsp

n

e

w

o

e

w

o

e

w

o

o

o

o

o

o

o

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

SR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

SR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

TR

sR

pR

e

o

e

o

ee

o

e

o

e

w

o

w

o

ww

o

w

o

w

o

o

o

o

oo

o

o

o

o

e

o

e

o

ee

o

e

o

ee

o

e

o

e

w

o

w

o

ww

o

w

o

ww

o

w

o

w

o

o

o

o

oo

o

o

o

oo

o

o

o

o

e

o

e

o

ee

o

e

o

e

w

o

w

o

ww

o

w

o

w

o

o

o

o

oo

o

o

o

o

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎝

⎛

−=

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎝

⎛

Δ

Δ

Δ

Δ

Δ

Δ

Δ

Δ

Δ

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎝

⎛

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

ΔΔ

3

3

3

2

2

2

1

1

1

3

3

3

2

2

2

1

1

1

3

3

3

3

3

3

2

3

2

3

2

3

3

3

3

3

3

3

2

3

2

3

2

3

3

3

3

3

3

3

2

3

2

3

2

3

3

2

3

2

3

2

2

2

2

2

2

2

1

2

1

2

1

2

3

2

3

2

3

2

2

2

2

2

2

2

1

2

1

2

1

2

3

2

3

2

3

2

2

2

2

2

2

2

1

2

1

2

1

2

2

1

2

1

2

1

1

1

1

1

1

1

2

1

2

1

2

1

1

1

1

1

1

1

2

1

2

1

2

1

1

1

1

1

1

1

000000000

000000000


GPU-based Reservoir Simulation

All timesteps done?

Start timestep loop

Initialization

Start Newton iteration

Build Jacobian & r.h.s.

Solve matrix equation

Converged?

Update and I/O

End

No

Yes

No

Yes

Data input

Yes

Data input

Time to end?

Start time step loop

Initialization

Start Newton iteration

Build Jacobian & r.h.s.

Solve matrix equation on GPU

Converged?

Update and I/O

End

No

Yes

No Matrix solver

Matrix preprocess on CPU

Generate PC M on CPU

Transfer DATA to GPU

Solve Ax = b on GPU

Transfer x back to CPU


Numerical Experiments •  CPU, Intel Xeon X5570, 8M cache, 2.93GHz, 32G memory •  GPU, NVIDIA Tesla C2050/C2070, 3G/6G memory, ECC

disabled •  Environment: Linux (Fedora 13 x86_64, kernel 2.6.34.7-61),

CUDA Toolkit 4.0, GCC 4.4.5 •  Compiler options: -arch=sm_20 –Xcompiler “-Wall” –O3


Numerical Experiments

case 1: Testing 4 preconditioners and 3 solvers.

case 2: Testing the effect of block number to the speedup performance of BILU(0) and BILU(T)

case 3: Testing the speedup of the whole simulation process

Matrix N NNZ NNZ/ROW SPE10-1 2,188,851 29,915,573 13.7 SPE10-2 2,188,851 29,915,573 13.7

Case description


Case 1

Matrix N NNZ NNZ/ROW SPE10-1 2,188,851 29,915,573 13.7

Relative tolerance 1E-3 Restart m 40

Neumann Polynomial order 16 METIS partition 8

Case description

Experimental parameter


Solver PC Iteration CPU time GPU time Speedup GMRES Neumann Poly 30 1620.5 125.9 12.9

ILU(0) 18 263.8 27.9 9.5 BILU(0) 20 307.8 27.5 11.2

Performance comparison

All the PC à speedup of 10x Bilu(0) and ILU(0) both converge fast.


Solver PC Iteration CPU time GPU time Speedup BiCGSTAB Neumann Poly 359 740.7 64.7 11.4

ILU(0) 260/249 84.3 11.7 7.2 BILU(0) 243 85.6 9 9.5

Performance comparison

Solver PC Iteration CPU time GPU time Speedup ORTHOMIN Neumann Poly 543 1449.9 114.1 12.7

ILU(0) 392 284.8 30.1 9.5 BILU(0) 400 283 27.6 10.3

Speed up à 10x BICGSTAB with ILU(0) and BILU(0) solved faster than GMRES and ORTHOMIN


Blks Iteration CPU time GPU time Speedup 1 21 121.1 15 8.14 4 23 124.33 15 8.27 8 23 126.40 15.32 8.23

16 29 180.06 19.05 9.44

Blks Iteration CPU time GPU time Speedup 1 5 34.20 11.70 2.92 4 7 44.67 10.35 4.30 8 7 45.78 9.58 4.76

16 10 63.13 12.43 5.07

GMRES(20) + block ILU(0)

GMRES(20) + block ILUT

Case 2


Case 3: GPU-based Reservoir Simulation

•  The SPE 10 Comparative Solution Project •  Fine grid (60 * 220 * 85) •  Highly heterogeneous

Relative tolerance 1e-3 Restart m 60

Neumann Polynomial order

16

Number of blocks 8


Solver PC CPU time GPU time Speedup

GMRES Neumann Poly 4h49m23s 29m43s 9.7

ILU(0) 1h30m16s 17m18s 5.2

BILU(0) 2h37m02s 20m18s 7.7

BiCGSTAB Neumann Poly 4h14m57s 36m13s 7

ILU(0) 1h0m40s 31m42s 1.9 BILU(0) 1h7m22s 34m28s 2

ORTHOMIN Neumann Poly 7h57m11s 56m27s 8.5 ILU(0) 2h25m48s 37m58s 3.8

BILUK(0) 2h37m23s 41m22s 3.8


Conclusions •  Implemented a GPU-based linear solver package including the blas

operation, linear solvers, preconditioners and several pre-process methods.

•  Compared the speedup performances of different linear solvers and preconditioners, and achieved around 10x speedup for SPE10 matrix.

•  Coupled our GPU-based linear solver package with a in-house black oil reservoir simulator to speed-up SPE10 simulation problem and using GMRES, we can achieve the speed up of 5-10 for different precondtioners.

All publications can be accessed at:

•  http://sites.google.com/site/monramax/publication


THANK YOU

large-scale reservoir simulation on gpu - gpu technology...

Documents