large scale reservoir simulation utilizing multiple gpus...large scale reservoir simulation...

Innovative Technology for Reservoir Engineers Ridgeway Kite

Large Scale Reservoir Simulation utilizing multiple GPUs

Garf Bowen

25th March 2014

Ridgeway Kite

Summary

• Introduce

– RKS

– Reservoir Simulation

• HPC goals

• Implementation

• Large scale simulations

• Results & future

Ridgeway Kite

• Start-up (April 2013)

– Long history in Reservoir Simulation

– Sister company, NITEC – consulting

• Differentiators

– Massively Parallel Code

– Multiple Realizations

– “Unconventional”

– Coupled surface network

Ridgeway Kite

Reservoir Simulation

• Finite Volume

• Unstructured (features)

• Implicit

𝑹 = ∆𝑴 − 𝑭 = 𝟎

Ridgeway Kite

Driving from London to Manchester…

Check the Ferrari or the traffic jam?

Lot of code that all needs to go fast Challenge is often “not to go slow” Can’t just focus on “hot spots”

Ridgeway Kite

HPC goals

• “not to go slow”

• Portability CPU/GPU (+clusters)

– Want to be future proof

• Simplification

– (massive) parallelization is an opportunity

– Developer efficiency

– Same result on any platform

Ridgeway Kite

Shuffle Calculate Pattern

Calculate “one-to-one”

Shuffle Scatter I/O from node zero

Gather output

• All data is on the GPU • Calculations are embarrassingly parallel • No indirect addressing • Ability to time separately

Ridgeway Kite

Example – calculate flows

More flows than cells

One cell involved in Multiple flows

One flow two cells Different flow same cell

Multiple copies – “slots”

Ridgeway Kite

Simplicity Returns? “one code” kernel many (independent) calls

Split to run MPI distributed

Underlying system - XPL • Takes care of running

• Different modes • Different architectures

Code looks serial again

Ridgeway Kite

Maps & MPI

Src Dest Slot

i1 j1 0

i2 j2 1

i3 j3 0

i4 j4 1

… … …

Maps are defined in “serial” space Not recommended

test.exe –cpu

test.exe –gpu

mpirun –np 16 test.exe

Ridgeway Kite

Simple Example

𝑥𝑖 = 𝐴𝑖−1𝑟𝑖 ∀𝑖

A - n*n small dense matrix ~millions of i’s LU factorization (partial pivoting)

template<typename KP>

struct Testinv

__host__ __device__

Testinv(Args* inArgs, int index, int N)

int ia=0;

mat<double,KP> a(inArgs,ia++,index);

vec<double,KP> r(inArgs,ia++,index);

vec<double,KP> x(inArgs,ia++,index);

mat<double,KP> w(inArgs,ia++,index);

w = a;

w.inv();

x.zero();

w.mult(r,x);

case rks::TestKernels::TEST_INV:

calc(inArgs, gpu<Testinv<kp> >, cpu<Testinv<kp> >);

break;

y = 2.35x + 2.31 y = 2.23x + 1.20

0.40 0.60 0.80 1.00 1.20

Scaling

Ridgeway Kite

Now add complexity well -- 40 8.4 jac -- 40 19.1

mass -- -- 40 1.9

flow -- -- 40 16.5

flow_ -- -- -- 4640 16.0

norm -- 40 0.4

lin -- 30 52.7 52.5

ling -- -- 30 2.0 2.0

lins -- -- 30 50.0

orth-it -- -- -- 30 49.9

norm -- -- -- -- 219 0.1

precon -- -- -- -- 189 48.1

pressure -- -- -- -- -- 189 46.9

====================================================

Comparison between:

cpu 1243.630 and gpu 147.960

====================================================

well -- 1.0 0.08

jac -- 1.0 12.62

mass -- -- 1.0 17.93

flow -- -- 1.0 11.66

flow_ -- -- -- 1.0 11.84

norm -- 1.0 2.19

lin -- 1.0 9.87

ling -- -- 1.0 1.70

lins -- -- 1.0 10.08

orth-it -- -- -- 1.0 10.10

norm -- -- -- -- 1.0 48.40

precon -- -- -- -- 1.0 9.17

pressure -- -- -- -- -- 1.0 8.24

Ridgeway Kite

Linear Solver Strategy Linear Solver Important

Communication Mechanism Challenge in parallel

environments

…but we’re only a small company And don’t really want to be linear

solver experts

Like getting “the same” results If we can implement a solver in XPL,

then we get this for free

Home grown May not be competitive

Using Nvidia’s AmgX Lose the “same” algorithm

Performing

Ridgeway Kite

Linear Solver

• Home Grown – Massively helpful for development

• Same results for all configurations

– Challenged algorithmically on difficult problems

• AmgX – Many options (pre-coded)

– Single GPU working well

– Focussed our effort here • MPI programming becomes important

Ridgeway Kite

Strategy as problem size increases

• Tesla C2070

– 6Gb memory

– Black Oil model 1million cells (SPE10 1.2e6 cells)

• Little incentive to utilize >1 GPU

• noting people will often run multiple realizations

• Larger model -> cluster

– Memory constrained

Ridgeway Kite

Scaling Test

• Based on SPE10 benchmark – Refined model – 5 wells – ~1 million cells

• We can fit: – Base case on one GPU – 4 (connected) copies on 4 GPUs

• Actually require 8 GPUs – Extra memory

– 16 copies on 16/32 GPUs

• Less challenging scaling than refinement

Ridgeway Kite

Memory & Performance

1 2 3 4 5 6 7 8

processors

Memory

4E6 - 8GPUs

1E6 - 2 GPUs

1E6 - 1GPU

"1E6-1GPU" "1E6-2GPU" "4E6-8GPU"

Example Performance

Lessons: Very variable timings Instrumentation vital Future: Still working on the 32-way case Classical MPI optimization step

Ridgeway Kite

Summary & Conclusions

• Shuffle-Calculate pattern

– Works for us, so far

– Portable

– Allowing us to exploit the GPU

– Using Amgx we’re able to tackle realistic cases requiring multi-GPU’s

• Full system

– Commercial offering early next year

Ridgeway Kite

Acknowledgements

• Co-authors: Bachar Zineddin & Tommy Miller

• Jeremy Appleyard, Nvidia

• “The authors would like to acknowledge the work presented here made use of the IRIDIS*/EMERALD* HPC facility provided by the Centre for Innovation.”

• Nvidia for AmgX beta access

Ridgeway Kite

Questions?

Ridgeway Kite

Backup#1 – LU code example //

// Main elimination loop

for (int j=0; j<m_xdim; j++)

// Sum

for (int i=0; i<j;i++)

double sum = (*this)(i,j);

for (int k=0; k<i; k++)

sum = sum - (*this)(i,k)*(*this)(k,j);

(*this)(i,j) = sum;

// Max

aamax = 0.0;

for(int i=j; i<m_xdim; i++)

double sum = (*this)(i,j);

for( int k=0; k<j; k++)

sum = sum - (*this)(i,k)*(*this)(k,j);

(*this)(i,j) = sum;

if ( std::fabs(vv[i]*sum)>=aamax )

imax = i;

aamax = std::fabs(vv[i]*sum);

// Swap

if (j!=imax)

for( int k=0; k<m_xdim; k++)

double dum = (*this)(imax,j);

(*this)(imax,k) = (*this)(j,k);

(*this)(j,k) = dum;

vv[imax] = vv[j];

// Store

piv[j] = imax;

if ( (*this)(j,j)==0.0 )

(*this)(j,j) = 1e-20;

// Set

if(j!=m_xdim)

double dum = 1.0/(*this)(j,j);

for( int i=j+1; i<m_xdim; i++ )

(*this)(i,j) = (*this)(i,j)*dum;

//------ End lu step ----

Ridgeway Kite

Backup#2 – Home Grown Solver

𝐴𝑤𝑤 𝐴𝑤𝑏

𝐴𝑏𝑤 𝐴𝑏𝑏

𝑥𝑤

𝑥𝑏=

𝑅𝑤

𝑅𝑏

𝐴𝑤𝑤 0

𝐴𝑏𝑤 𝐴𝑏𝑏∗

𝐼 𝐴𝑤𝑏∗

0 𝐼

𝑥𝑤

𝑥𝑏=

𝑅𝑤

𝑅𝑏

𝐴𝑏𝑏∗ =𝐴𝑏𝑏 − 𝐴𝑏𝑤𝐴𝑤𝑤

−1𝐴𝑤𝑏

1 − 𝑥 −1 = 1 + 𝑥 + 𝑥2 + 𝑥3 + … . .

𝑥 = 𝐴𝑏𝑤𝐴𝑤𝑤−1𝐴𝑤𝑏 𝐴𝑏𝑏

large scale reservoir simulation utilizing multiple gpus...large scale reservoir simulation...

Documents

reservoir simulation link

carbonate reservoir simulation

reservoir simulation note03

advanced petroleum reservoir simulation

principles of reservoir simulation

laboratory simulation of reservoir-induced seismicity ·...

scalable multi-cache simulation using gpus

cbm reservoir simulation

physical simulation on gpus

coupled geomechanical reservoir simulation

updates on reservoir simulation

reservoir parameter estimation for reservoir simulation...

petroleum reservoir simulation - geokniga

tnavigator reservoir simulation

reservoir simulation note05

aziz khaled - reservoir simulation

reservoir simulation note02

class reservoir simulation

reservoir simulation manual extract

applied reservoir simulation course