an adventure in porting: adding gpu-acceleration to open...

S4599: An Adventure in Por2ng: Adding GPU-‐Accelera0on to Open-‐Source 3D

Elas0c Wave Modeling Robin M. Weiss and Jeffery Shragge

March 26, 2014

GPU Technology Conference

Seismic Data Modeling •  Elas0c modeling

–  Anisotropic and heterogeneous materials

•  Seismic imaging with elas0c models –  Increasing interest in this area –  Solved as an inverse problem –  Requires accurate, fast forward solvers

•  Computa0onal issues –  Large model sizes –  Generally requires a ton of compute power

Open-‐Source 3D Elas0c Modeling •  Open-‐source computa0onal geophysics package

–  Madagascar – www.reproducibility.org (RSFSRC/user/rweiss) –  Targets reproducibility, transparency, and clarity

•  Madagascar’s CPU-‐based (w/ OpenMP) EWE solver implementa0on: –  Paul Sava, Colorado School of Mines –  Anisotropic elas0c wave equa0on –  Stress-‐S0ffness formula0on –  3D Finite-‐difference solver, centered deriva0ves, regular grid

•  8th order spa0al, 2nd order 0me Weiss and Shragge, 2013, Solving 3D anisotropic elas2c wave equa2ons on parallel GPU devices, Geophysics

Finite-‐Difference and GPUs •  Represents the bulk of the compute work •  Compact FD stencils map nicely to GPU

–  Regular memory access pa\ern –  Some data reuse (use shared memory) –  But s0ll memory bound

•  GTC 2013 -‐ S3176 Micikevicius, 2009, 3D Finite Difference Computa2on on GPUs using CUDA, NVIDIA

CPUStart

Read Source Wavelet,

Stiffness, and Density Fields

GPU

Maximum # of Timesteps Completed?

Write data to

disk

No

Done

Yes

Initialize GPU data structures

and copy model data to

GPU

Stress Source?

Accel Source?

No

YesFree

Surface?

No

Displacement To Strain

...

.........

1Strain to Stress

...

......... ...

2Near Surface BC

...

3

Inject Stress Source4

Stress To Acceleration

...

.........

5

Inject Accel Source6

Advance Time

...

......... ...

7Boundary Conditions

8

Extract Receiver Data9

Yes

Op0mize: The usual suspects… •  Stash data in shared memory and/or registers •  Find good block dimensions •  Repeated expression replacement •  Kernel fusing – Keep an eye on register usage (-Xptxas -v)

•  CUDA Visual Profiler is your friend

GPU-‐CPU Displacement Comparison

depth-‐component of displacement field overlaid on velocity model

Off to the races… •  All experiments run on N3 data sets for 2000 0mesteps •  CPU results computed with Intel Sandy Bridge E5-‐2670 2.60GHz

–  2x 8-‐core, IBM iDataPlex dx360 M4 –  University of Chicago’s Midway Compute Cluster

•  GPU results computed with Nvidia K40 –  4x GPUs per node –  Nvidia PSG cluster – thanks, Nvidia!

Off to the races…

1

4

16

64

256

1024

4096

16384

65536

64 128 256 512

Time (secon

ds)

Cube Size

GPU 16-‐CPU

Off to the races…

Cube Size Disp. To Strain Strain to Stress Stress to Accel. Advance Disp.

100 1405 1657 1064 2976 148 1333 1702 1053 3077 196 1467 1714 1164 3051 244 1449 1662 1054 3051 292 1404 1703 993 2950 340 1282 1721 1021 3170

Throughput (Mpoints/sec)

Overall Throughput

0

50

100

150

200

250

300

350

0 50 100 150 200 250 300 350 400 450

Mpo

ints/sec

Cube Size

Speedup

0

1

2

3

4

5

6

100 148 196 244 292 340 388

Speedu

p Factor

Cube Size

GPU vs. 16-‐CPU

Speedup

0

10

20

30

40

50

60

70

80

100 148 196 244 292 340 388

Speedu

p Factor

Cube Size

GPU vs. 16-‐CPU GPU vs. 1-‐CPU

While this is all well and good…

1 2 4 8 16 32 64

128 256 512 1024 2048 4096 8192

16384 32768

64 128 256 512 1024

Time (secon

ds)

Cube Size

GPU 16-‐CPU

…CPU implementa0on has us beat

1 2 4 8 16 32 64

128 256 512 1024 2048 4096 8192

16384 32768

64 128 256 512 1024

Time (secon

ds)

Cube Size

GPU 16-‐CPU

GPU Memory Wall •  Limited GPU memory becomes an issue •  In our 3D implementa0on, each grid point requires:

–  1 density value –  9 s0ffness coefficients –  3 displacement components (x3 0me steps) –  6 strain/stress field components –  3 accelera0on components

•  GlobalMemory ≥ 28 * gridPoints * sizeof(float)

Mul0ple Devices •  Increase GPU memory (in aggregate) by distribu0ng grid across mul0ple devices

•  Not quite embarrassingly parallel –  FD stencil needs halo cells from neighbor device(s) – Halo dance happens each itera0on –  Communica0on can become costly

•  Slice on the slowest varying axis –  Easy to compute stride

Mul0ple Devices

GPU 0 GPU 1 GPU 2 GPU 3

Mul0ple Devices: Take 1 •  One host-‐process per device •  Communicate through MPI

GPU 0 GPU 1

Node 0

Rank 0 Rank 1MPI

PCIe

PCIe

Mul0ple Devices: Take 1 •  Generalizes nicely, scales to mul0ple nodes •  Handle arbitrary* number of devices & nodes

GPU 0 GPU 1

Node 0

Rank 0 Rank 1MPI

PCIe

PCIe

GPU 0 GPU 1

Node 1

Rank 0 Rank 1MPI

PCIe

PCIe

GPU 0 GPU 1

Node N

Rank 0 Rank 1MPI

PCIe

PCIe

MPI MPI

...

Mul0ple Devices: Take 1 •  But: the data dance is awkward (read: slow)

GPU 0 GPU 1

Node 0

Rank 0 Rank 1

MPI_Send / MPI_Recv

cuda

Mem

cpy

cuda

Mem

cpy

send_buf recv_buf

gpu0_buf gpu1_buf

Mul0ple Devices: Take 1

16

32

64

128

256

512

1024

2048

64 128 256 512 1024

Time (secon

ds)

Cube Size

MPI-‐1 MPI-‐2 MPI-‐4 MPI-‐8

Mul0ple Devices: Take 1 •  Can we fix this? –  CUDA-‐aware MPI…? – GPU Direct…? –  RDMA…? –  Black magic…?

•  Open-‐source research code considera0ons –  Keep it simple!

Mul0ple Devices: Take 2 •  One host-‐process per node •  Communicate through peer-‐to-‐peer memcpy

GPU 0 GPU 1

Node 0

Host Process

P2P

GPU 2 GPU 3

Mul0ple Devices: Take 2

16

32

64

128

256

512

1024

64 128 256 512 1024

Time (secon

ds)

Cube Size

P2P-‐1 P2P-‐2 P2P-‐4

Overall Throughput

0

200

400

600

800

1000

1200

1400

0 100 200 300 400 500 600 700 800 900 1000

Mpo

ints/sec

Cube Size

P2P-‐1 P2P-‐2 P2P-‐4

Mul0ple Devices: Take 2 •  But we’re s0ll limited to one node…

GPU 0 GPU 1

Node 0

Host Process

P2P

GPU 2 GPU 3

Mul0ple Devices: Take 2.1 •  Scale to mul0ple nodes with MPI •  Handle arbitrary* number of devices & nodes

GPU 0 GPU 1

Node 0

Rank 0 MPI


Node 1

Rank 1

P2P


Node N

Rank N

P2P

GPU 2 GPU 3

P2P

MPI

...

Mul0ple Devices: Take 2.1

16

32

64

128

256

512

1024

2048

64 128 256 512 1024

Time (secon

ds)

Cube Size

P2P-‐1 P2P-‐2 P2P-‐4 P2P-‐4(2)

Mul0ple Devices: Take 2.1

16

32

64

128

256

512

1024

2048

64 128 256 512 1024

Time (secon

ds)

Cube Size

P2P-‐4(2) MPI-‐8

Overall Throughput

0

200

400

600

800

1000

1200

1400

0 100 200 300 400 500 600 700 800 900 1000

Mpo

ints/sec

Cube Size

P2P-‐1 P2P-‐2 P2P-‐4 P2P-‐4(2)

Speedup

0

5

10

15

20

25

30

100 148 196 244 292 340 388 436 484 532 580 628 676 724 772 820 868

Speedu

p Factor

Cube Size

GPU vs 16-‐CPU P2P-‐4 vs 16-‐CPU P2P-‐4(2) vs 16-‐CPU

Wrapping up •  Elas0c wave modeling represents a complex computa0on problem

–  But, GPU is a promising route for accelera0on –  Flexible mul0-‐GPU implementa0on offers considerable speedup on large grids

•  Represents a seed for further research, modifica0ons, and extensions

•  Check out our code: www.reproducibility.org –  Exec: sfewefd2d_gpu, sfewefd3d_gpu_p2p, sfewefd3d_multiNode!–  Source: user/rweiss –  Examples: book/uwa/geo2013Elas0cModellingGPU/ –  Improvements and extensions welcomed... ...bug reports welcome as well

S4599: An Adventure in Por2ng: Adding GPU-‐Accelera0on to Open-‐Source 3D

Elas0c Wave Modeling Robin M. Weiss and Jeffery Shragge

March 26, 2014

GPU Technology Conference

an adventure in porting: adding gpu-acceleration to open...

Documents