large scale reservoir simulation utilizing multiple gpus...large scale reservoir simulation...
Post on 07-Mar-2021
16 Views
Preview:
TRANSCRIPT
Innovative Technology for Reservoir Engineers Ridgeway Kite
Large Scale Reservoir Simulation utilizing multiple GPUs
Garf Bowen
25th March 2014
Ridgeway Kite
Summary
• Introduce
– RKS
– Reservoir Simulation
• HPC goals
• Implementation
• Large scale simulations
• Results & future
Ridgeway Kite
RKS
• Start-up (April 2013)
– Long history in Reservoir Simulation
– Sister company, NITEC – consulting
• Differentiators
– Massively Parallel Code
– Multiple Realizations
– “Unconventional”
– Coupled surface network
Ridgeway Kite
Reservoir Simulation
• Finite Volume
• Unstructured (features)
• Implicit
𝑹 = ∆𝑴 − 𝑭 = 𝟎
Ridgeway Kite
Driving from London to Manchester…
Check the Ferrari or the traffic jam?
Lot of code that all needs to go fast Challenge is often “not to go slow” Can’t just focus on “hot spots”
Ridgeway Kite
HPC goals
• “not to go slow”
• Portability CPU/GPU (+clusters)
– Want to be future proof
• Simplification
– (massive) parallelization is an opportunity
– Developer efficiency
– Same result on any platform
Ridgeway Kite
Shuffle Calculate Pattern
Calculate “one-to-one”
Shuffle Scatter I/O from node zero
Gather output
• All data is on the GPU • Calculations are embarrassingly parallel • No indirect addressing • Ability to time separately
Ridgeway Kite
Example – calculate flows
More flows than cells
One cell involved in Multiple flows
One flow two cells Different flow same cell
Multiple copies – “slots”
Ridgeway Kite
Simplicity Returns? “one code” kernel many (independent) calls
Split to run MPI distributed
Underlying system - XPL • Takes care of running
• Different modes • Different architectures
Code looks serial again
Ridgeway Kite
Maps & MPI
Src Dest Slot
i1 j1 0
i2 j2 1
i3 j3 0
i4 j4 1
… … …
Maps are defined in “serial” space Not recommended
test.exe –cpu
test.exe –gpu
mpirun –np 16 test.exe
Ridgeway Kite
Simple Example
𝑥𝑖 = 𝐴𝑖−1𝑟𝑖 ∀𝑖
A - n*n small dense matrix ~millions of i’s LU factorization (partial pivoting)
template<typename KP>
struct Testinv
{
__host__ __device__
Testinv(Args* inArgs, int index, int N)
{
int ia=0;
mat<double,KP> a(inArgs,ia++,index);
vec<double,KP> r(inArgs,ia++,index);
vec<double,KP> x(inArgs,ia++,index);
mat<double,KP> w(inArgs,ia++,index);
w = a;
w.inv();
x.zero();
w.mult(r,x);
case rks::TestKernels::TEST_INV:
calc(inArgs, gpu<Testinv<kp> >, cpu<Testinv<kp> >);
break;
y = 2.35x + 2.31 y = 2.23x + 1.20
2.00
3.00
4.00
5.00
0.40 0.60 0.80 1.00 1.20
log
tim
e (
secs
)
Log n
Scaling
CPU
GPU
Ridgeway Kite
Now add complexity well -- 40 8.4 jac -- 40 19.1
mass -- -- 40 1.9
flow -- -- 40 16.5
flow_ -- -- -- 4640 16.0
norm -- 40 0.4
lin -- 30 52.7 52.5
ling -- -- 30 2.0 2.0
lins -- -- 30 50.0
orth-it -- -- -- 30 49.9
norm -- -- -- -- 219 0.1
precon -- -- -- -- 189 48.1
pressure -- -- -- -- -- 189 46.9
====================================================
Comparison between:
cpu 1243.630 and gpu 147.960
====================================================
well -- 1.0 0.08
jac -- 1.0 12.62
mass -- -- 1.0 17.93
flow -- -- 1.0 11.66
flow_ -- -- -- 1.0 11.84
norm -- 1.0 2.19
lin -- 1.0 9.87
ling -- -- 1.0 1.70
lins -- -- 1.0 10.08
orth-it -- -- -- 1.0 10.10
norm -- -- -- -- 1.0 48.40
precon -- -- -- -- 1.0 9.17
pressure -- -- -- -- -- 1.0 8.24
Ridgeway Kite
Linear Solver Strategy Linear Solver Important
Communication Mechanism Challenge in parallel
environments
…but we’re only a small company And don’t really want to be linear
solver experts
Like getting “the same” results If we can implement a solver in XPL,
then we get this for free
Home grown May not be competitive
Using Nvidia’s AmgX Lose the “same” algorithm
Performing
Ridgeway Kite
Linear Solver
• Home Grown – Massively helpful for development
• Same results for all configurations
– Challenged algorithmically on difficult problems
• AmgX – Many options (pre-coded)
– Single GPU working well
– Focussed our effort here • MPI programming becomes important
Ridgeway Kite
Strategy as problem size increases
• Tesla C2070
– 6Gb memory
– Black Oil model 1million cells (SPE10 1.2e6 cells)
• Little incentive to utilize >1 GPU
• noting people will often run multiple realizations
• Larger model -> cluster
– Memory constrained
Ridgeway Kite
Scaling Test
• Based on SPE10 benchmark – Refined model – 5 wells – ~1 million cells
• We can fit: – Base case on one GPU – 4 (connected) copies on 4 GPUs
• Actually require 8 GPUs – Extra memory
– 16 copies on 16/32 GPUs
• Less challenging scaling than refinement
Ridgeway Kite
Memory & Performance
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 2 3 4 5 6 7 8
Me
mo
ry M
b
processors
Memory
4E6 - 8GPUs
1E6 - 2 GPUs
1E6 - 1GPU
0
200
400
600
800
1000
1200
1400
"1E6-1GPU" "1E6-2GPU" "4E6-8GPU"
Wal
l Clo
ck T
ime
(se
cs)
Example Performance
Lessons: Very variable timings Instrumentation vital Future: Still working on the 32-way case Classical MPI optimization step
Ridgeway Kite
Summary & Conclusions
• Shuffle-Calculate pattern
– Works for us, so far
– Portable
– Allowing us to exploit the GPU
– Using Amgx we’re able to tackle realistic cases requiring multi-GPU’s
• Full system
– Commercial offering early next year
Ridgeway Kite
Acknowledgements
• Co-authors: Bachar Zineddin & Tommy Miller
• Jeremy Appleyard, Nvidia
• “The authors would like to acknowledge the work presented here made use of the IRIDIS*/EMERALD* HPC facility provided by the Centre for Innovation.”
• Nvidia for AmgX beta access
Ridgeway Kite
Questions?
Ridgeway Kite
Backup#1 – LU code example //
// Main elimination loop
//
for (int j=0; j<m_xdim; j++)
{
//
// Sum
//
for (int i=0; i<j;i++)
{
double sum = (*this)(i,j);
for (int k=0; k<i; k++)
{
sum = sum - (*this)(i,k)*(*this)(k,j);
}
(*this)(i,j) = sum;
}
//
// Max
//
aamax = 0.0;
for(int i=j; i<m_xdim; i++)
{
double sum = (*this)(i,j);
for( int k=0; k<j; k++)
{
sum = sum - (*this)(i,k)*(*this)(k,j);
}
(*this)(i,j) = sum;
if ( std::fabs(vv[i]*sum)>=aamax )
{
imax = i;
aamax = std::fabs(vv[i]*sum);
}
}
//
// Swap
//
if (j!=imax)
{
for( int k=0; k<m_xdim; k++)
{
double dum = (*this)(imax,j);
(*this)(imax,k) = (*this)(j,k);
(*this)(j,k) = dum;
}
vv[imax] = vv[j];
}
//
// Store
//
piv[j] = imax;
if ( (*this)(j,j)==0.0 )
{
(*this)(j,j) = 1e-20;
}
//
// Set
//
if(j!=m_xdim)
{
double dum = 1.0/(*this)(j,j);
for( int i=j+1; i<m_xdim; i++ )
{
(*this)(i,j) = (*this)(i,j)*dum;
}
}
}
//------ End lu step ----
Ridgeway Kite
Backup#2 – Home Grown Solver
𝐴𝑤𝑤 𝐴𝑤𝑏
𝐴𝑏𝑤 𝐴𝑏𝑏
𝑥𝑤
𝑥𝑏=
𝑅𝑤
𝑅𝑏
𝐴𝑤𝑤 0
𝐴𝑏𝑤 𝐴𝑏𝑏∗
𝐼 𝐴𝑤𝑏∗
0 𝐼
𝑥𝑤
𝑥𝑏=
𝑅𝑤
𝑅𝑏
𝐴𝑏𝑏∗ =𝐴𝑏𝑏 − 𝐴𝑏𝑤𝐴𝑤𝑤
−1𝐴𝑤𝑏
1 − 𝑥 −1 = 1 + 𝑥 + 𝑥2 + 𝑥3 + … . .
𝑥 = 𝐴𝑏𝑤𝐴𝑤𝑤−1𝐴𝑤𝑏 𝐴𝑏𝑏
−1
Note:
With:
top related