scaling sinbad software to 3-d on yemoja

of 54 /54
University of British Columbia SLIM Curt Da Silva, Haneet Wason, Mathias Louboutin, Bas Peters, Shashin Sharan, Zhilong Fang Scaling SINBAD software to 3-D on Yemoja Wednesday, 28 October, 15 Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0). Copyright (c) 2018 SINBAD consortium - SLIM group @ The University of British Columbia.

Author: others

Post on 23-Oct-2021

8 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

University  of  British  Columbia SLIM
Curt  Da  Silva,  Haneet  Wason,  Mathias  Louboutin,  Bas  Peters,  Shashin   Sharan,  Zhilong  Fang
Scaling SINBAD software to 3-D on Yemoja
Wednesday, 28 October, 15
This talk
Showcase  SLIM  software  as  it  applies  to  larg(er)  scale  problems  on   the  Yemoja  cluster  
Performance  scaling • as  #  parallel  resources  increases • comparisons  to  existing  codes  in  C
Large  data  examples
3
Wednesday, 28 October, 15
4
A : time domain forward modelling matrix
u : vectorized wavefield of all time steps and modelling grid points
q : source
Continuous  form
qk :  Source  wave  field  at  time  step  k
Wednesday, 28 October, 15
FWI Gradient
The  FWI  gradients  have  to  pass  the  adjoint  test: we  only  compute  actions  of                            ,  never  the  matrices  itself to  ensure  they  are  true  adjoints,  the  migration/demigration   operators  need  to  satisfy
6
J,JT
Gradient Test
7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 -12
10 -10
10 -8
10 -6
10 -4
10 -2
10 0
10 2
Zeroth order Taylor Error O(h) First order Taylor Error O(h2)
||F(m0 + h · m, 0 + h · ) F(m0, 0)||
||F(m0 + h · m, 0 + h · ) F(m0, 0) h · Jmm h · J||
Ensure  2nd  order  convergence  of    Taylor  expansion      
Wednesday, 28 October, 15
8
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
Good initial model
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
10
FWI
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
Chevron Modeling code
Stencil  based  (No  matrix)
Matlab basic
Matrix  based
Contains   • Forward  time  stepping  and  adjoint  time  stepping  (true  adjoint) • Jacobian  and  its  adjoint  (true  adjoint) • Necessary  for  FWI,LSRTM.....
13 Wednesday, 28 October, 15
Matlab basic
Time  step
14
A1_inv : 1.891 GB A2 : 16.968 GB A3 : 1.891 GB Ps : 1.703 KB U1 : 645.481 MB U2 : 645.481 MB U3 : 645.481 MB adjoint_mode : 1.000 B mode : 8.000 B nt : 8.000 B op : 645.494 MB x : 38.586 KB y : 645.481 MB ========================== T : 23.902 GB
Wednesday, 28 October, 15
Double  precision  =>  Single  precision
15 Wednesday, 28 October, 15
Matlab advanced
Single  precision  +  stencil  based  :   • 20  times  less  memory  than  sparse  matrices
Single  precision  : • Wavefields  two  times  less  expensive  memory-­wise
C  MatVec • Multi-­threaded  over  RHS   • No  matrix  instead  of  20 • Communication  overhead  between  Matlab-­C
16 Wednesday, 28 October, 15
Setup : 17 sec for 400x400x400 (20 sources)
17
N : 12.000 B P : 8.000 B U1 : 5.673 GB U2 : 5.673 GB U3 : 5.673 GB a1i : 322.741 MB a2 : 322.741 MB a3 : 322.741 MB adjoint_mode : 1.000 B d : 4.000 B i : 8.000 B idx : 52.000 B idxsrc : 7.594 KB mode : 8.000 B nt : 8.000 B op : 645.495 MB tsrc : 8.000 B wsrc : 7.594 KB x : 19.293 KB y : 5.673 GB ========================== T : 24.269 GB
Wednesday, 28 October, 15
18
N : 12.000 B P : 8.000 B U1 : 5.673 GB U2 : 5.673 GB U3 : 5.673 GB a1i : 322.741 MB a2 : 322.741 MB a3 : 322.741 MB adjoint_mode : 1.000 B d : 4.000 B i : 8.000 B idx : 52.000 B idxsrc : 7.594 KB mode : 8.000 B nt : 8.000 B op : 645.495 MB tsrc : 8.000 B wsrc : 7.594 KB x : 19.293 KB y : 5.673 GB ========================== T : 24.269 GB
Wednesday, 28 October, 15
• Compared  to  single  precision  multi  RHS  multiplication
19 Wednesday, 28 October, 15
Single time step
Matlab
• 40  sec  per  time  step  for  20  sources  (20  runs)  
20 Wednesday, 28 October, 15
Single time step
Chevron
• .10  sec  per  time-­step  (20  threads)
• 2  sec  per  time  step  for  20  sources  (needs  to  run  20  times)
• Stencil  based,  0  RAM  for  matrices
• 1Gb  of  RAM  (one    source  at  a  time)
21 Wednesday, 28 October, 15
Single time step
Chevron
• 2  sec  per  time-­step  (1  threads)
• 2  sec  per  time  step  for  20  sources  (can  run  20  at  once)
• Stencil  based,  0  RAM  for  matrices
• 1Gb  of  RAM  (one    source  at  a  time)
22 Wednesday, 28 October, 15
Single time step
Single  precision  MEX  Matlab  
• ~  4  sec  for  20  source  per  time  step
• 0Gb  of  RAM  for  the  matrices
• 1Gb  of  RAM  per  source
23 Wednesday, 28 October, 15
24
Wednesday, 28 October, 15
FWI performance scaling Model size : 134*134*28 Number of shots : 30 Number of frequencies : 1
25
1 node * 8 processes 1 hour 0.5 GB
1 node * 16 processes 0.53 hours 1 GB
5 nodes * 16 processes 0.15 hours 1 GB
Wednesday, 28 October, 15
FWI performance scaling Model size : 268*268*56 Number of shots : 30 Number of frequencies : 1
26
1 node * 8 processes 12 hours 4 GB
1 node * 16 processes 6.3 hours 8 GB
5 nodes * 16 processes 1.3 hours 8 GB
Wednesday, 28 October, 15
Scaling
• 6  km  cube  model • ~40  wavelengths  propagated  between  source  &  receiver • 8  nodes • Each  node  solves  the  PDE’s  in  the  sub-­problems  for  8  right-­hand-­sides   simultaneously.  
• This  setup  can  process  8  x  8  =  64  PDE  solves  simultaneously. • Fixed  tolerance  for  all  PDE  solves.
28 Wednesday, 28 October, 15
29 8  Hz.  Varying  number  of  sources  &  receivers  (8  -­  256).
101 102101
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
time per source
not  enough  sources  &  receivers to  use  computational  capacity  of  the  nodes
Wednesday, 28 October, 15
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
more  sources  &  receivers, still  close  to  constant  time  per  source
Wednesday, 28 October, 15
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
more  sources  &  receivers, still  close  to  constant  time  per  source
other  costs  (including  communication)  increase,   but  remain  relatively  small
Wednesday, 28 October, 15
Total comp U comp W other
8  Hz.  64  sources  &  64  receivers.  Varying  number  of  nodes  (2  -­  16).
Wednesday, 28 October, 15
Total comp U comp W other
not  enough  sources  &  receivers  to  use  computational   capacity  of  the  nodes;  results  in  smaller  speedup
Wednesday, 28 October, 15
Wednesday, 28 October, 15
36
3D  ocean  bottom  cable/node  data  set  generated  on  the  BG  3D  Compass  model - model  size  (nz  x  nx  x  ny):  164  x  601  x  601 - grid  size:  6  m  x  25  m  x  25  m    
  Data  dimensions  (2501  x  500  x  500  x  85  x  85)
- number  of  time  samples:  2501   - number  of  receivers  in  x  &  y  direction:  500   - number  of  shots  in  x  &  y  direction:  85 - sampling  intervals:  0.004  s,  25  m  (receiver),  150  m  (shot)
Simulated  with  the  Chevron  3D  modeling  code
Wednesday, 28 October, 15
De pt
h [m
De pt
h [m
Node  partition:  128  GB Number  of  nodes:  660
Simulation  per  3D  shot:  1.5  hours Cumulative  simulation  time  (85  x  85  shots):  27  hours   Memory  storage  of  one  shot  record:  2.5  GB Memory  storage  of  all  shot  records:  18  TB
Wednesday, 28 October, 15
Wednesday, 28 October, 15
(nt x nrx x nry = 2500 x 500 x 500)
Wednesday, 28 October, 15
Number  of  shots (X  x  Y)
Disk  space   (TB)
Simulation estimation
Wednesday, 28 October, 15
Performance scaling
Size  of  3D  survey:  2500  x  500  x  10  x  500  x  50 - number  of  time  samples:  2500   - number  of  streamers:  10  (with  500  channels  each) - number  of  shots  in  x  &  y  direction:  500  x  50
Number  of  workers Number  of  SPGL1   iteraDons
Recovery  Dme  per   seismic  line  (hrs)
Recovery  Dme  all   data  (days)
20 200 78 162
50 200 31 64
100 200 16 33
500 200 3 6
Wednesday, 28 October, 15
Wednesday, 28 October, 15
U12
U12
Generated  from  the  BG  compass  model  using  time-­stepping
Transformed  in  to  frequency  slices,  ~26  GB  in  size
85  x  85  sources  at  150m  spacing,  500  x  500  receivers  at  25m  spacing
90%  receiver  pairs  removed,  on-­grid  sampling
48 Wednesday, 28 October, 15
Tensor interpolation
Parallelized  over  frequencies,  implicit  parallelism  via  Matlab’s  calls  to  LAPACK  
20  iterations,  Gauss-­Newton  Hierarchical  Tucker  interpolation
Each  frequency  slice  takes  13-­15  hours  to  interpolate,  ~70-­80  GB  max  memory
Run  on  the  Yemoja  cluster  in  Brazil  “out  of  the  box”
49 Wednesday, 28 October, 15
HT Interpolation - 90% missing receivers Common source gather - 10Hz
50
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
receiver x 100 200 300 400 500
re ce
51
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
re ce
HT Interpolation - 90% missing receivers Common source gather - 10Hz
Wednesday, 28 October, 15
52
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
re ce
Wednesday, 28 October, 15
SN R
[d B]
Wednesday, 28 October, 15
Acknowledgements
This  work  was  financially  supported  by  SINBAD  Consortium  members  BG  Group,  BGP,  CGG,  Chevron,  ConocoPhillips,   DownUnder  GeoSolutions,  Hess,  Petrobras,  PGS,  Schlumberger,  Statoil,  Sub  Salt  Solutions  and  Woodside;  and  by  the   Natural  Sciences  and  Engineering  Research  Council  of  Canada  via  NSERC  Collaborative  Research  and  Development  Grant   DNOISEII  (CRDPJ  375142-­-­08).
Thank  you  for  your  attention
Wednesday, 28 October, 15