scaling sinbad software to 3-d on yemoja

Post on 23-Oct-2021

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

University  of  British  ColumbiaSLIM

Curt  Da  Silva,  Haneet  Wason,  Mathias  Louboutin,  Bas  Peters,  Shashin  Sharan,  Zhilong  Fang

Scaling SINBAD software to 3-D on Yemoja

Wednesday, 28 October, 15

Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0).Copyright (c) 2018 SINBAD consortium - SLIM group @ The University of British Columbia.

This talk

Showcase  SLIM  software  as  it  applies  to  larg(er)  scale  problems  on  the  Yemoja  cluster  

Performance  scaling• as  #  parallel  resources  increases• comparisons  to  existing  codes  in  C

Large  data  examples

2Wednesday, 28 October, 15

3

FWI - Time DomainMathias Louboutin

Wednesday, 28 October, 15

Linear  algebra  form

Au = q

1

v2@2u

@t2�r2u = q

Acoustic wave equation in time domain

4

A : time domain forward modelling matrix

u : vectorized wavefield of all time steps and modelling grid points

q : source

Continuous  form

Wednesday, 28 October, 15

Usual forward modelling

Fully  discretized  wave  equation

with:

5

A1 = diag(m

4t2)

A3 = diag(m

4t2)

A2 = �L� 2diag(m

4t2)

A1uk+1 +A2u

k +A3uk�1 = qk�1

qk :  Source  wave  field  at  time  step  k

Wednesday, 28 October, 15

FWI Gradient

The  FWI  gradients  have  to  pass  the  adjoint  test:‣ we  only  compute  actions  of                            ,  never  the  matrices  itself‣ to  ensure  they  are  true  adjoints,  the  migration/demigration  operators  need  to  satisfy

6

J,JT

||�dTJ�m� �mTJT �d|| < ✏

Wednesday, 28 October, 15

Gradient Test

7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -110 -12

10 -10

10 -8

10 -6

10 -4

10 -2

10 0

10 2

10 4 Multiparameter Gradient Test

Zeroth order Taylor ErrorO(h)First order Taylor ErrorO(h2)

||F(m0 + h · �m, ✏0 + h · �✏)� F(m0, ✏0)||

||F(m0 + h · �m, ✏0 + h · �✏)�F(m0, ✏0)� h · Jm�m� h · J✏�✏||

Ensure  2nd  order  convergence  of    Taylor  expansion      

Wednesday, 28 October, 15

True velocity

8

X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Dep

th (m

)

0

500

1000

1500

2000 1.5

2

2.5

3

3.5

4

4.5

Wednesday, 28 October, 15

9

Good initial model

X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Dep

th (m

)

0

500

1000

1500

2000 1.5

2

2.5

3

3.5

4

4.5

Wednesday, 28 October, 15

10

FWI

X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Dep

th (m

)

0

500

1000

1500

2000 1.5

2

2.5

3

3.5

4

4.5

Wednesday, 28 October, 15

Implementation

11Wednesday, 28 October, 15

Chevron Modeling code

Modeling  only

SSE  and  AVX  enabled

10th  order  in  space,  4th  order  in  time

Stencil  based  (No  matrix)

12Wednesday, 28 October, 15

Matlab basic

Matrix  based

No  fancy  speed  up  yet

Contains  • Forward  time  stepping  and  adjoint  time  stepping  (true  adjoint)• Jacobian  and  its  adjoint  (true  adjoint)• Necessary  for  FWI,LSRTM.....

13Wednesday, 28 October, 15

Matlab basic

Setup  :  300sec  for  400x400x400  (1  source)

Time  step

14

A1_inv : 1.891 GBA2 : 16.968 GBA3 : 1.891 GBPs : 1.703 KBU1 : 645.481 MBU2 : 645.481 MBU3 : 645.481 MBadjoint_mode : 1.000 Bmode : 8.000 Bnt : 8.000 Bop : 645.494 MBx : 38.586 KBy : 645.481 MB==========================T : 23.902 GB

Wednesday, 28 October, 15

Matlab advanced

No  matrix  =>  stencil  based  +  3  vectors

Double  precision  =>  Single  precision

Matlab  MatVec  =>  C  MatVec  with  multi  RHS  

15Wednesday, 28 October, 15

Matlab advanced

Single  precision  +  stencil  based  :  • 20  times  less  memory  than  sparse  matrices

Single  precision  :• Wavefields  two  times  less  expensive  memory-­‐wise

C  MatVec• Multi-­‐threaded  over  RHS  • No  matrix  instead  of  20• Communication  overhead  between  Matlab-­‐C

16Wednesday, 28 October, 15

Setup : 17 sec for 400x400x400 (20 sources)

17

N : 12.000 BP : 8.000 BU1 : 5.673 GBU2 : 5.673 GBU3 : 5.673 GBa1i : 322.741 MBa2 : 322.741 MBa3 : 322.741 MBadjoint_mode : 1.000 Bd : 4.000 Bi : 8.000 Bidx : 52.000 Bidxsrc : 7.594 KBmode : 8.000 Bnt : 8.000 Bop : 645.495 MBtsrc : 8.000 Bwsrc : 7.594 KBx : 19.293 KBy : 5.673 GB==========================T : 24.269 GB

Wednesday, 28 October, 15

Memory usage

18

N : 12.000 BP : 8.000 BU1 : 5.673 GBU2 : 5.673 GBU3 : 5.673 GBa1i : 322.741 MBa2 : 322.741 MBa3 : 322.741 MBadjoint_mode : 1.000 Bd : 4.000 Bi : 8.000 Bidx : 52.000 Bidxsrc : 7.594 KBmode : 8.000 Bnt : 8.000 Bop : 645.495 MBtsrc : 8.000 Bwsrc : 7.594 KBx : 19.293 KBy : 5.673 GB==========================T : 24.269 GB

Wednesday, 28 October, 15

Timings and memory

How  does  Matlab  scale

• Compared  with  the  Chevron  modelling  code

• Compared  to  single  precision  multi  RHS  multiplication

19Wednesday, 28 October, 15

Single time step

For  a  given  561x561x194  cube

Matlab

• 100Gb  of  RAM  

• 2  sec  per  time-­‐step  per  source

• 40  sec  per  time  step  for  20  sources  (20  runs)  

20Wednesday, 28 October, 15

Single time step

For  a  given  561x561x194  cube

Chevron

• .10  sec  per  time-­‐step  (20  threads)

• 2  sec  per  time  step  for  20  sources  (needs  to  run  20  times)

• Stencil  based,  0  RAM  for  matrices

• 1Gb  of  RAM  (one    source  at  a  time)

21Wednesday, 28 October, 15

Single time step

For  a  given  561x561x194  cube

Chevron

• 2  sec  per  time-­‐step  (1  threads)

• 2  sec  per  time  step  for  20  sources  (can  run  20  at  once)

• Stencil  based,  0  RAM  for  matrices

• 1Gb  of  RAM  (one    source  at  a  time)

22Wednesday, 28 October, 15

Single time step

For  a  given  561x561x194  cube

Single  precision  MEX  Matlab  

• ~.2  sec  per  Time-­‐step  per  source    

• ~  4  sec  for  20  source  per  time  step

• 0Gb  of  RAM  for  the  matrices

• 1Gb  of  RAM  per  source

23Wednesday, 28 October, 15

24

Full Waveform Inversion - Time HarmonicZhilong Fang

Wednesday, 28 October, 15

FWI performance scalingModel size : 134*134*28Number of shots : 30Number of frequencies : 1

25

Time per iteration Memory use per node

1 node * 8 processes 1 hour 0.5 GB

1 node * 16 processes 0.53 hours 1 GB

5 nodes * 16 processes 0.15 hours 1 GB

Wednesday, 28 October, 15

FWI performance scalingModel size : 268*268*56Number of shots : 30Number of frequencies : 1

26

Time per iteration Memory use per node

1 node * 8 processes 12 hours 4 GB

1 node * 16 processes 6.3 hours 8 GB

5 nodes * 16 processes 1.3 hours 8 GB

Wednesday, 28 October, 15

27

3D WRIBas Peters

Wednesday, 28 October, 15

Scaling

• 6  km  cube  model• ~40  wavelengths  propagated  between  source  &  receiver• 8  nodes• Each  node  solves  the  PDE’s  in  the  sub-­‐problems  for  8  right-­‐hand-­‐sides  simultaneously.  

• This  setup  can  process  8  x  8  =  64  PDE  solves  simultaneously.• Fixed  tolerance  for  all  PDE  solves.

28Wednesday, 28 October, 15

298  Hz.  Varying  number  of  sources  &  receivers  (8  -­‐  256).

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz

Totalcomp Ucomp Wother

0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

Wednesday, 28 October, 15

30

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz

Totalcomp Ucomp Wother

0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

not  enough  sources  &  receiversto  use  computational  capacity  of  the  nodes

Wednesday, 28 October, 15

31

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz

Totalcomp Ucomp Wother

0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

more  sources  &  receivers,still  close  to  constant  time  per  source

Wednesday, 28 October, 15

32

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz

Totalcomp Ucomp Wother

0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

more  sources  &  receivers,still  close  to  constant  time  per  source

other  costs  (including  communication)  increase,  but  remain  relatively  small

Wednesday, 28 October, 15

33

100 101101

102

103

104

# of nodes

time

[s]

64 sources & receivers , 8Hz

Totalcomp Ucomp Wother

8  Hz.  64  sources  &  64  receivers.  Varying  number  of  nodes  (2  -­‐  16).

Wednesday, 28 October, 15

34

100 101101

102

103

104

# of nodes

time

[s]

64 sources & receivers , 8Hz

Totalcomp Ucomp Wother

not  enough  sources  &  receivers  to  use  computational  capacity  of  the  nodes;  results  in  smaller  speedup

Wednesday, 28 October, 15

35

3D data simulationHaneet Wason & Shashin Sharan

Wednesday, 28 October, 15

Simulation parameters

36

3D  ocean  bottom  cable/node  data  set  generated  on  the  BG  3D  Compass  model- model  size  (nz  x  nx  x  ny):  164  x  601  x  601- grid  size:  6  m  x  25  m  x  25  m    

 Data  dimensions  (2501  x  500  x  500  x  85  x  85)

- number  of  time  samples:  2501  - number  of  receivers  in  x  &  y  direction:  500  - number  of  shots  in  x  &  y  direction:  85- sampling  intervals:  0.004  s,  25  m  (receiver),  150  m  (shot)

Simulated  with  the  Chevron  3D  modeling  code

Wednesday, 28 October, 15

37

BG 3D Compass model

Lateral [m]0 5000 10000 15000

Dept

h [m

]0

500

1000

1500

2000

Lateral [m]0 5000 10000 15000

Dept

h [m

]

0

500

1000

1500

2000

x  direction

y  direction

Wednesday, 28 October, 15

38

25  m

25  mx

y

150  m

150  m

Source

Receiver

Source-receiver layout

Wednesday, 28 October, 15

Computational resources used

39

Time & memory usage

Node  partition:  128  GBNumber  of  nodes:  660

Simulation  per  3D  shot:  1.5  hoursCumulative  simulation  time  (85  x  85  shots):  27  hours  Memory  storage  of  one  shot  record:  2.5  GBMemory  storage  of  all  shot  records:  18  TB

Wednesday, 28 October, 15

40

Running jobs & activated nodes (SENAI Yemoja cluster)

Wednesday, 28 October, 15

41

3D shot records

Ry  direction Rx  direction

(nt x nrx x nry = 2500 x 500 x 500)

Wednesday, 28 October, 15

42

X  &  Y  receiver  spacing(m)

X  &  Y  shot  spacing  (m)

Number  of  shots(X  x  Y)

Disk  space  (TB)

25 25 500  x  500 610

25 50 250  x  250 153

25 75 165  x  165 67

25 100 125  x  125 38.5

25 125 100  x  100 25

25 150 85  x  85 18

Simulation estimation

Wednesday, 28 October, 15

43

Simultaneous acquisitionHaneet Wason & Shashin Sharan

Wednesday, 28 October, 15

44

Performance scaling

Size  of  3D  survey:  2500  x  500  x  10  x  500  x  50- number  of  time  samples:  2500  - number  of  streamers:  10  (with  500  channels  each)- number  of  shots  in  x  &  y  direction:  500  x  50

Number  of  workers Number  of  SPGL1  iteraDons

Recovery  Dme  per  seismic  line  (hrs)

Recovery  Dme  all  data  (days)

20 200 78 162

50 200 31 64

100 200 16 33

500 200 3 6

Wednesday, 28 October, 15

45

Interpolation - Tensor CompletionCurt Da Silva

Wednesday, 28 October, 15

Hierarchical Tucker formatX � n1 ⇥ n2 ⇥ n3 ⇥ n4 tensor

U12

n1n2

k12

! U12n1

n2k12

Wednesday, 28 October, 15

Hierarchical Tucker formatX � n1 ⇥ n2 ⇥ n3 ⇥ n4 tensor

U12

n1n2

k12

! U12n1

n2k12

! U1

UT2

n1

k1 n2

k2

B12

Wednesday, 28 October, 15

Data set

Generated  from  the  BG  compass  model  using  time-­‐stepping

Transformed  in  to  frequency  slices,  ~26  GB  in  size

85  x  85  sources  at  150m  spacing,  500  x  500  receivers  at  25m  spacing

90%  receiver  pairs  removed,  on-­‐grid  sampling

48Wednesday, 28 October, 15

Tensor interpolation

Parallelized  over  frequencies,  implicit  parallelism  via  Matlab’s  calls  to  LAPACK  

20  iterations,  Gauss-­‐Newton  Hierarchical  Tucker  interpolation

Each  frequency  slice  takes  13-­‐15  hours  to  interpolate,  ~70-­‐80  GB  max  memory

Run  on  the  Yemoja  cluster  in  Brazil  “out  of  the  box”

49Wednesday, 28 October, 15

HT Interpolation - 90% missing receiversCommon source gather - 10Hz

50

receiver x50 100 150 200 250 300 350 400 450 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

True  data Subsampled  data  (input)

receiver x100 200 300 400 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

Wednesday, 28 October, 15

51

receiver x50 100 150 200 250 300 350 400 450 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

receiver x100 200 300 400 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

True  data Interpolated  data  -­‐  SNR  19.3  dB

HT Interpolation - 90% missing receiversCommon source gather - 10Hz

Wednesday, 28 October, 15

52

receiver x50 100 150 200 250 300 350 400 450 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

True  data Difference

receiver x100 200 300 400 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

HT Interpolation - 90% missing receiversCommon source gather - 10Hz

Wednesday, 28 October, 15

SNR vs Frequency

53frequency [hz]

50 60 70 80 90 100

SNR

[dB]

18

20

22

24

26

28

30Train SNRTest SNR

Wednesday, 28 October, 15

Acknowledgements

This  work  was  financially  supported  by  SINBAD  Consortium  members  BG  Group,  BGP,  CGG,  Chevron,  ConocoPhillips,  DownUnder  GeoSolutions,  Hess,  Petrobras,  PGS,  Schlumberger,  Statoil,  Sub  Salt  Solutions  and  Woodside;  and  by  the  Natural  Sciences  and  Engineering  Research  Council  of  Canada  via  NSERC  Collaborative  Research  and  Development  Grant  DNOISEII  (CRDPJ  375142-­‐-­‐08).

Thank  you  for  your  attention

Wednesday, 28 October, 15

top related