scaling sinbad software to 3-d on yemoja
Embed Size (px)
TRANSCRIPT
University of British Columbia SLIM
Curt Da Silva, Haneet Wason, Mathias Louboutin, Bas Peters, Shashin Sharan, Zhilong Fang
Scaling SINBAD software to 3-D on Yemoja
Wednesday, 28 October, 15
This talk
Showcase SLIM software as it applies to larg(er) scale problems on the Yemoja cluster
Performance scaling • as # parallel resources increases • comparisons to existing codes in C
Large data examples
3
Wednesday, 28 October, 15
4
A : time domain forward modelling matrix
u : vectorized wavefield of all time steps and modelling grid points
q : source
Continuous form
qk : Source wave field at time step k
Wednesday, 28 October, 15
FWI Gradient
The FWI gradients have to pass the adjoint test: we only compute actions of , never the matrices itself to ensure they are true adjoints, the migration/demigration operators need to satisfy
6
J,JT
Gradient Test
7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 -12
10 -10
10 -8
10 -6
10 -4
10 -2
10 0
10 2
Zeroth order Taylor Error O(h) First order Taylor Error O(h2)
||F(m0 + h · m, 0 + h · ) F(m0, 0)||
||F(m0 + h · m, 0 + h · ) F(m0, 0) h · Jmm h · J||
Ensure 2nd order convergence of Taylor expansion
Wednesday, 28 October, 15
8
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
Good initial model
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
10
FWI
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
Chevron Modeling code
Stencil based (No matrix)
Matlab basic
Matrix based
Contains • Forward time stepping and adjoint time stepping (true adjoint) • Jacobian and its adjoint (true adjoint) • Necessary for FWI,LSRTM.....
13 Wednesday, 28 October, 15
Matlab basic
Time step
14
A1_inv : 1.891 GB A2 : 16.968 GB A3 : 1.891 GB Ps : 1.703 KB U1 : 645.481 MB U2 : 645.481 MB U3 : 645.481 MB adjoint_mode : 1.000 B mode : 8.000 B nt : 8.000 B op : 645.494 MB x : 38.586 KB y : 645.481 MB ========================== T : 23.902 GB
Wednesday, 28 October, 15
Double precision => Single precision
15 Wednesday, 28 October, 15
Matlab advanced
Single precision + stencil based : • 20 times less memory than sparse matrices
Single precision : • Wavefields two times less expensive memory-wise
C MatVec • Multi-threaded over RHS • No matrix instead of 20 • Communication overhead between Matlab-C
16 Wednesday, 28 October, 15
Setup : 17 sec for 400x400x400 (20 sources)
17
N : 12.000 B P : 8.000 B U1 : 5.673 GB U2 : 5.673 GB U3 : 5.673 GB a1i : 322.741 MB a2 : 322.741 MB a3 : 322.741 MB adjoint_mode : 1.000 B d : 4.000 B i : 8.000 B idx : 52.000 B idxsrc : 7.594 KB mode : 8.000 B nt : 8.000 B op : 645.495 MB tsrc : 8.000 B wsrc : 7.594 KB x : 19.293 KB y : 5.673 GB ========================== T : 24.269 GB
Wednesday, 28 October, 15
18
N : 12.000 B P : 8.000 B U1 : 5.673 GB U2 : 5.673 GB U3 : 5.673 GB a1i : 322.741 MB a2 : 322.741 MB a3 : 322.741 MB adjoint_mode : 1.000 B d : 4.000 B i : 8.000 B idx : 52.000 B idxsrc : 7.594 KB mode : 8.000 B nt : 8.000 B op : 645.495 MB tsrc : 8.000 B wsrc : 7.594 KB x : 19.293 KB y : 5.673 GB ========================== T : 24.269 GB
Wednesday, 28 October, 15
• Compared to single precision multi RHS multiplication
19 Wednesday, 28 October, 15
Single time step
Matlab
• 40 sec per time step for 20 sources (20 runs)
20 Wednesday, 28 October, 15
Single time step
Chevron
• .10 sec per time-step (20 threads)
• 2 sec per time step for 20 sources (needs to run 20 times)
• Stencil based, 0 RAM for matrices
• 1Gb of RAM (one source at a time)
21 Wednesday, 28 October, 15
Single time step
Chevron
• 2 sec per time-step (1 threads)
• 2 sec per time step for 20 sources (can run 20 at once)
• Stencil based, 0 RAM for matrices
• 1Gb of RAM (one source at a time)
22 Wednesday, 28 October, 15
Single time step
Single precision MEX Matlab
• ~ 4 sec for 20 source per time step
• 0Gb of RAM for the matrices
• 1Gb of RAM per source
23 Wednesday, 28 October, 15
24
Wednesday, 28 October, 15
FWI performance scaling Model size : 134*134*28 Number of shots : 30 Number of frequencies : 1
25
1 node * 8 processes 1 hour 0.5 GB
1 node * 16 processes 0.53 hours 1 GB
5 nodes * 16 processes 0.15 hours 1 GB
Wednesday, 28 October, 15
FWI performance scaling Model size : 268*268*56 Number of shots : 30 Number of frequencies : 1
26
1 node * 8 processes 12 hours 4 GB
1 node * 16 processes 6.3 hours 8 GB
5 nodes * 16 processes 1.3 hours 8 GB
Wednesday, 28 October, 15
Scaling
• 6 km cube model • ~40 wavelengths propagated between source & receiver • 8 nodes • Each node solves the PDE’s in the sub-problems for 8 right-hand-sides simultaneously.
• This setup can process 8 x 8 = 64 PDE solves simultaneously. • Fixed tolerance for all PDE solves.
28 Wednesday, 28 October, 15
29 8 Hz. Varying number of sources & receivers (8 - 256).
101 102101
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
time per source
not enough sources & receivers to use computational capacity of the nodes
Wednesday, 28 October, 15
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
more sources & receivers, still close to constant time per source
Wednesday, 28 October, 15
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
more sources & receivers, still close to constant time per source
other costs (including communication) increase, but remain relatively small
Wednesday, 28 October, 15
Total comp U comp W other
8 Hz. 64 sources & 64 receivers. Varying number of nodes (2 - 16).
Wednesday, 28 October, 15
Total comp U comp W other
not enough sources & receivers to use computational capacity of the nodes; results in smaller speedup
Wednesday, 28 October, 15
Wednesday, 28 October, 15
36
3D ocean bottom cable/node data set generated on the BG 3D Compass model - model size (nz x nx x ny): 164 x 601 x 601 - grid size: 6 m x 25 m x 25 m
Data dimensions (2501 x 500 x 500 x 85 x 85)
- number of time samples: 2501 - number of receivers in x & y direction: 500 - number of shots in x & y direction: 85 - sampling intervals: 0.004 s, 25 m (receiver), 150 m (shot)
Simulated with the Chevron 3D modeling code
Wednesday, 28 October, 15
De pt
h [m
De pt
h [m
Node partition: 128 GB Number of nodes: 660
Simulation per 3D shot: 1.5 hours Cumulative simulation time (85 x 85 shots): 27 hours Memory storage of one shot record: 2.5 GB Memory storage of all shot records: 18 TB
Wednesday, 28 October, 15
Wednesday, 28 October, 15
(nt x nrx x nry = 2500 x 500 x 500)
Wednesday, 28 October, 15
Number of shots (X x Y)
Disk space (TB)
Simulation estimation
Wednesday, 28 October, 15
Performance scaling
Size of 3D survey: 2500 x 500 x 10 x 500 x 50 - number of time samples: 2500 - number of streamers: 10 (with 500 channels each) - number of shots in x & y direction: 500 x 50
Number of workers Number of SPGL1 iteraDons
Recovery Dme per seismic line (hrs)
Recovery Dme all data (days)
20 200 78 162
50 200 31 64
100 200 16 33
500 200 3 6
Wednesday, 28 October, 15
Wednesday, 28 October, 15
U12
U12
Generated from the BG compass model using time-stepping
Transformed in to frequency slices, ~26 GB in size
85 x 85 sources at 150m spacing, 500 x 500 receivers at 25m spacing
90% receiver pairs removed, on-grid sampling
48 Wednesday, 28 October, 15
Tensor interpolation
Parallelized over frequencies, implicit parallelism via Matlab’s calls to LAPACK
20 iterations, Gauss-Newton Hierarchical Tucker interpolation
Each frequency slice takes 13-15 hours to interpolate, ~70-80 GB max memory
Run on the Yemoja cluster in Brazil “out of the box”
49 Wednesday, 28 October, 15
HT Interpolation - 90% missing receivers Common source gather - 10Hz
50
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
receiver x 100 200 300 400 500
re ce
51
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
re ce
HT Interpolation - 90% missing receivers Common source gather - 10Hz
Wednesday, 28 October, 15
52
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
re ce
Wednesday, 28 October, 15
SN R
[d B]
Wednesday, 28 October, 15
Acknowledgements
This work was financially supported by SINBAD Consortium members BG Group, BGP, CGG, Chevron, ConocoPhillips, DownUnder GeoSolutions, Hess, Petrobras, PGS, Schlumberger, Statoil, Sub Salt Solutions and Woodside; and by the Natural Sciences and Engineering Research Council of Canada via NSERC Collaborative Research and Development Grant DNOISEII (CRDPJ 375142--08).
Thank you for your attention
Wednesday, 28 October, 15
Curt Da Silva, Haneet Wason, Mathias Louboutin, Bas Peters, Shashin Sharan, Zhilong Fang
Scaling SINBAD software to 3-D on Yemoja
Wednesday, 28 October, 15
This talk
Showcase SLIM software as it applies to larg(er) scale problems on the Yemoja cluster
Performance scaling • as # parallel resources increases • comparisons to existing codes in C
Large data examples
3
Wednesday, 28 October, 15
4
A : time domain forward modelling matrix
u : vectorized wavefield of all time steps and modelling grid points
q : source
Continuous form
qk : Source wave field at time step k
Wednesday, 28 October, 15
FWI Gradient
The FWI gradients have to pass the adjoint test: we only compute actions of , never the matrices itself to ensure they are true adjoints, the migration/demigration operators need to satisfy
6
J,JT
Gradient Test
7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 -12
10 -10
10 -8
10 -6
10 -4
10 -2
10 0
10 2
Zeroth order Taylor Error O(h) First order Taylor Error O(h2)
||F(m0 + h · m, 0 + h · ) F(m0, 0)||
||F(m0 + h · m, 0 + h · ) F(m0, 0) h · Jmm h · J||
Ensure 2nd order convergence of Taylor expansion
Wednesday, 28 October, 15
8
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
Good initial model
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
10
FWI
X location (m) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
D ep
th (m
Chevron Modeling code
Stencil based (No matrix)
Matlab basic
Matrix based
Contains • Forward time stepping and adjoint time stepping (true adjoint) • Jacobian and its adjoint (true adjoint) • Necessary for FWI,LSRTM.....
13 Wednesday, 28 October, 15
Matlab basic
Time step
14
A1_inv : 1.891 GB A2 : 16.968 GB A3 : 1.891 GB Ps : 1.703 KB U1 : 645.481 MB U2 : 645.481 MB U3 : 645.481 MB adjoint_mode : 1.000 B mode : 8.000 B nt : 8.000 B op : 645.494 MB x : 38.586 KB y : 645.481 MB ========================== T : 23.902 GB
Wednesday, 28 October, 15
Double precision => Single precision
15 Wednesday, 28 October, 15
Matlab advanced
Single precision + stencil based : • 20 times less memory than sparse matrices
Single precision : • Wavefields two times less expensive memory-wise
C MatVec • Multi-threaded over RHS • No matrix instead of 20 • Communication overhead between Matlab-C
16 Wednesday, 28 October, 15
Setup : 17 sec for 400x400x400 (20 sources)
17
N : 12.000 B P : 8.000 B U1 : 5.673 GB U2 : 5.673 GB U3 : 5.673 GB a1i : 322.741 MB a2 : 322.741 MB a3 : 322.741 MB adjoint_mode : 1.000 B d : 4.000 B i : 8.000 B idx : 52.000 B idxsrc : 7.594 KB mode : 8.000 B nt : 8.000 B op : 645.495 MB tsrc : 8.000 B wsrc : 7.594 KB x : 19.293 KB y : 5.673 GB ========================== T : 24.269 GB
Wednesday, 28 October, 15
18
N : 12.000 B P : 8.000 B U1 : 5.673 GB U2 : 5.673 GB U3 : 5.673 GB a1i : 322.741 MB a2 : 322.741 MB a3 : 322.741 MB adjoint_mode : 1.000 B d : 4.000 B i : 8.000 B idx : 52.000 B idxsrc : 7.594 KB mode : 8.000 B nt : 8.000 B op : 645.495 MB tsrc : 8.000 B wsrc : 7.594 KB x : 19.293 KB y : 5.673 GB ========================== T : 24.269 GB
Wednesday, 28 October, 15
• Compared to single precision multi RHS multiplication
19 Wednesday, 28 October, 15
Single time step
Matlab
• 40 sec per time step for 20 sources (20 runs)
20 Wednesday, 28 October, 15
Single time step
Chevron
• .10 sec per time-step (20 threads)
• 2 sec per time step for 20 sources (needs to run 20 times)
• Stencil based, 0 RAM for matrices
• 1Gb of RAM (one source at a time)
21 Wednesday, 28 October, 15
Single time step
Chevron
• 2 sec per time-step (1 threads)
• 2 sec per time step for 20 sources (can run 20 at once)
• Stencil based, 0 RAM for matrices
• 1Gb of RAM (one source at a time)
22 Wednesday, 28 October, 15
Single time step
Single precision MEX Matlab
• ~ 4 sec for 20 source per time step
• 0Gb of RAM for the matrices
• 1Gb of RAM per source
23 Wednesday, 28 October, 15
24
Wednesday, 28 October, 15
FWI performance scaling Model size : 134*134*28 Number of shots : 30 Number of frequencies : 1
25
1 node * 8 processes 1 hour 0.5 GB
1 node * 16 processes 0.53 hours 1 GB
5 nodes * 16 processes 0.15 hours 1 GB
Wednesday, 28 October, 15
FWI performance scaling Model size : 268*268*56 Number of shots : 30 Number of frequencies : 1
26
1 node * 8 processes 12 hours 4 GB
1 node * 16 processes 6.3 hours 8 GB
5 nodes * 16 processes 1.3 hours 8 GB
Wednesday, 28 October, 15
Scaling
• 6 km cube model • ~40 wavelengths propagated between source & receiver • 8 nodes • Each node solves the PDE’s in the sub-problems for 8 right-hand-sides simultaneously.
• This setup can process 8 x 8 = 64 PDE solves simultaneously. • Fixed tolerance for all PDE solves.
28 Wednesday, 28 October, 15
29 8 Hz. Varying number of sources & receivers (8 - 256).
101 102101
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
time per source
not enough sources & receivers to use computational capacity of the nodes
Wednesday, 28 October, 15
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
more sources & receivers, still close to constant time per source
Wednesday, 28 October, 15
0 50 100 150 200 250 300 40
50
60
70
80
90
100
110
nsrc
more sources & receivers, still close to constant time per source
other costs (including communication) increase, but remain relatively small
Wednesday, 28 October, 15
Total comp U comp W other
8 Hz. 64 sources & 64 receivers. Varying number of nodes (2 - 16).
Wednesday, 28 October, 15
Total comp U comp W other
not enough sources & receivers to use computational capacity of the nodes; results in smaller speedup
Wednesday, 28 October, 15
Wednesday, 28 October, 15
36
3D ocean bottom cable/node data set generated on the BG 3D Compass model - model size (nz x nx x ny): 164 x 601 x 601 - grid size: 6 m x 25 m x 25 m
Data dimensions (2501 x 500 x 500 x 85 x 85)
- number of time samples: 2501 - number of receivers in x & y direction: 500 - number of shots in x & y direction: 85 - sampling intervals: 0.004 s, 25 m (receiver), 150 m (shot)
Simulated with the Chevron 3D modeling code
Wednesday, 28 October, 15
De pt
h [m
De pt
h [m
Node partition: 128 GB Number of nodes: 660
Simulation per 3D shot: 1.5 hours Cumulative simulation time (85 x 85 shots): 27 hours Memory storage of one shot record: 2.5 GB Memory storage of all shot records: 18 TB
Wednesday, 28 October, 15
Wednesday, 28 October, 15
(nt x nrx x nry = 2500 x 500 x 500)
Wednesday, 28 October, 15
Number of shots (X x Y)
Disk space (TB)
Simulation estimation
Wednesday, 28 October, 15
Performance scaling
Size of 3D survey: 2500 x 500 x 10 x 500 x 50 - number of time samples: 2500 - number of streamers: 10 (with 500 channels each) - number of shots in x & y direction: 500 x 50
Number of workers Number of SPGL1 iteraDons
Recovery Dme per seismic line (hrs)
Recovery Dme all data (days)
20 200 78 162
50 200 31 64
100 200 16 33
500 200 3 6
Wednesday, 28 October, 15
Wednesday, 28 October, 15
U12
U12
Generated from the BG compass model using time-stepping
Transformed in to frequency slices, ~26 GB in size
85 x 85 sources at 150m spacing, 500 x 500 receivers at 25m spacing
90% receiver pairs removed, on-grid sampling
48 Wednesday, 28 October, 15
Tensor interpolation
Parallelized over frequencies, implicit parallelism via Matlab’s calls to LAPACK
20 iterations, Gauss-Newton Hierarchical Tucker interpolation
Each frequency slice takes 13-15 hours to interpolate, ~70-80 GB max memory
Run on the Yemoja cluster in Brazil “out of the box”
49 Wednesday, 28 October, 15
HT Interpolation - 90% missing receivers Common source gather - 10Hz
50
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
receiver x 100 200 300 400 500
re ce
51
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
re ce
HT Interpolation - 90% missing receivers Common source gather - 10Hz
Wednesday, 28 October, 15
52
receiver x 50 100 150 200 250 300 350 400 450 500
re ce
re ce
Wednesday, 28 October, 15
SN R
[d B]
Wednesday, 28 October, 15
Acknowledgements
This work was financially supported by SINBAD Consortium members BG Group, BGP, CGG, Chevron, ConocoPhillips, DownUnder GeoSolutions, Hess, Petrobras, PGS, Schlumberger, Statoil, Sub Salt Solutions and Woodside; and by the Natural Sciences and Engineering Research Council of Canada via NSERC Collaborative Research and Development Grant DNOISEII (CRDPJ 375142--08).
Thank you for your attention
Wednesday, 28 October, 15