an adventure in porting: adding gpu-acceleration to open...
TRANSCRIPT
S4599: An Adventure in Por2ng: Adding GPU-‐Accelera0on to Open-‐Source 3D
Elas0c Wave Modeling Robin M. Weiss and Jeffery Shragge
March 26, 2014
GPU Technology Conference
Seismic Data Modeling • Elas0c modeling
– Anisotropic and heterogeneous materials
• Seismic imaging with elas0c models – Increasing interest in this area – Solved as an inverse problem – Requires accurate, fast forward solvers
• Computa0onal issues – Large model sizes – Generally requires a ton of compute power
Open-‐Source 3D Elas0c Modeling • Open-‐source computa0onal geophysics package
– Madagascar – www.reproducibility.org (RSFSRC/user/rweiss) – Targets reproducibility, transparency, and clarity
• Madagascar’s CPU-‐based (w/ OpenMP) EWE solver implementa0on: – Paul Sava, Colorado School of Mines – Anisotropic elas0c wave equa0on – Stress-‐S0ffness formula0on – 3D Finite-‐difference solver, centered deriva0ves, regular grid
• 8th order spa0al, 2nd order 0me Weiss and Shragge, 2013, Solving 3D anisotropic elas2c wave equa2ons on parallel GPU devices, Geophysics
Finite-‐Difference and GPUs • Represents the bulk of the compute work • Compact FD stencils map nicely to GPU
– Regular memory access pa\ern – Some data reuse (use shared memory) – But s0ll memory bound
• GTC 2013 -‐ S3176 Micikevicius, 2009, 3D Finite Difference Computa2on on GPUs using CUDA, NVIDIA
CPUStart
Read Source Wavelet,
Stiffness, and Density Fields
GPU
Maximum # of Timesteps Completed?
Write data to
disk
No
Done
Yes
Initialize GPU data structures
and copy model data to
GPU
Stress Source?
Accel Source?
No
YesFree
Surface?
No
Displacement To Strain
...
.........
1Strain to Stress
...
......... ...
2Near Surface BC
...
3
Inject Stress Source4
Stress To Acceleration
...
.........
5
Inject Accel Source6
Advance Time
...
......... ...
7Boundary Conditions
8
Extract Receiver Data9
Yes
Op0mize: The usual suspects… • Stash data in shared memory and/or registers • Find good block dimensions • Repeated expression replacement • Kernel fusing – Keep an eye on register usage (-Xptxas -v)
• CUDA Visual Profiler is your friend
GPU-‐CPU Displacement Comparison
depth-‐component of displacement field overlaid on velocity model
Off to the races… • All experiments run on N3 data sets for 2000 0mesteps • CPU results computed with Intel Sandy Bridge E5-‐2670 2.60GHz
– 2x 8-‐core, IBM iDataPlex dx360 M4 – University of Chicago’s Midway Compute Cluster
• GPU results computed with Nvidia K40 – 4x GPUs per node – Nvidia PSG cluster – thanks, Nvidia!
Off to the races…
1
4
16
64
256
1024
4096
16384
65536
64 128 256 512
Time (secon
ds)
Cube Size
GPU 16-‐CPU
Off to the races…
Cube Size Disp. To Strain Strain to Stress Stress to Accel. Advance Disp.
100 1405 1657 1064 2976 148 1333 1702 1053 3077 196 1467 1714 1164 3051 244 1449 1662 1054 3051 292 1404 1703 993 2950 340 1282 1721 1021 3170
Throughput (Mpoints/sec)
Overall Throughput
0
50
100
150
200
250
300
350
0 50 100 150 200 250 300 350 400 450
Mpo
ints/sec
Cube Size
Speedup
0
1
2
3
4
5
6
100 148 196 244 292 340 388
Speedu
p Factor
Cube Size
GPU vs. 16-‐CPU
Speedup
0
10
20
30
40
50
60
70
80
100 148 196 244 292 340 388
Speedu
p Factor
Cube Size
GPU vs. 16-‐CPU GPU vs. 1-‐CPU
While this is all well and good…
1 2 4 8 16 32 64
128 256 512 1024 2048 4096 8192
16384 32768
64 128 256 512 1024
Time (secon
ds)
Cube Size
GPU 16-‐CPU
…CPU implementa0on has us beat
1 2 4 8 16 32 64
128 256 512 1024 2048 4096 8192
16384 32768
64 128 256 512 1024
Time (secon
ds)
Cube Size
GPU 16-‐CPU
GPU Memory Wall • Limited GPU memory becomes an issue • In our 3D implementa0on, each grid point requires:
– 1 density value – 9 s0ffness coefficients – 3 displacement components (x3 0me steps) – 6 strain/stress field components – 3 accelera0on components
• GlobalMemory ≥ 28 * gridPoints * sizeof(float)
Mul0ple Devices • Increase GPU memory (in aggregate) by distribu0ng grid across mul0ple devices
• Not quite embarrassingly parallel – FD stencil needs halo cells from neighbor device(s) – Halo dance happens each itera0on – Communica0on can become costly
• Slice on the slowest varying axis – Easy to compute stride
Mul0ple Devices
GPU 0 GPU 1 GPU 2 GPU 3
Mul0ple Devices: Take 1 • One host-‐process per device • Communicate through MPI
GPU 0 GPU 1
Node 0
Rank 0 Rank 1MPI
PCIe
PCIe
Mul0ple Devices: Take 1 • Generalizes nicely, scales to mul0ple nodes • Handle arbitrary* number of devices & nodes
GPU 0 GPU 1
Node 0
Rank 0 Rank 1MPI
PCIe
PCIe
GPU 0 GPU 1
Node 1
Rank 0 Rank 1MPI
PCIe
PCIe
GPU 0 GPU 1
Node N
Rank 0 Rank 1MPI
PCIe
PCIe
MPI MPI
...
Mul0ple Devices: Take 1 • But: the data dance is awkward (read: slow)
GPU 0 GPU 1
Node 0
Rank 0 Rank 1
MPI_Send / MPI_Recv
cuda
Mem
cpy
cuda
Mem
cpy
send_buf recv_buf
gpu0_buf gpu1_buf
Mul0ple Devices: Take 1
16
32
64
128
256
512
1024
2048
64 128 256 512 1024
Time (secon
ds)
Cube Size
MPI-‐1 MPI-‐2 MPI-‐4 MPI-‐8
Mul0ple Devices: Take 1 • Can we fix this? – CUDA-‐aware MPI…? – GPU Direct…? – RDMA…? – Black magic…?
• Open-‐source research code considera0ons – Keep it simple!
Mul0ple Devices: Take 2 • One host-‐process per node • Communicate through peer-‐to-‐peer memcpy
GPU 0 GPU 1
Node 0
Host Process
P2P
GPU 2 GPU 3
Mul0ple Devices: Take 2
16
32
64
128
256
512
1024
64 128 256 512 1024
Time (secon
ds)
Cube Size
P2P-‐1 P2P-‐2 P2P-‐4
Overall Throughput
0
200
400
600
800
1000
1200
1400
0 100 200 300 400 500 600 700 800 900 1000
Mpo
ints/sec
Cube Size
P2P-‐1 P2P-‐2 P2P-‐4
Mul0ple Devices: Take 2 • But we’re s0ll limited to one node…
GPU 0 GPU 1
Node 0
Host Process
P2P
GPU 2 GPU 3
Mul0ple Devices: Take 2.1 • Scale to mul0ple nodes with MPI • Handle arbitrary* number of devices & nodes
GPU 0 GPU 1
Node 0
Rank 0 MPI
GPU 2 GPU 3 GPU 0 GPU 1
Node 1
Rank 1
P2P
GPU 2 GPU 3 GPU 0 GPU 1
Node N
Rank N
P2P
GPU 2 GPU 3
P2P
MPI
...
Mul0ple Devices: Take 2.1
16
32
64
128
256
512
1024
2048
64 128 256 512 1024
Time (secon
ds)
Cube Size
P2P-‐1 P2P-‐2 P2P-‐4 P2P-‐4(2)
Mul0ple Devices: Take 2.1
16
32
64
128
256
512
1024
2048
64 128 256 512 1024
Time (secon
ds)
Cube Size
P2P-‐4(2) MPI-‐8
Overall Throughput
0
200
400
600
800
1000
1200
1400
0 100 200 300 400 500 600 700 800 900 1000
Mpo
ints/sec
Cube Size
P2P-‐1 P2P-‐2 P2P-‐4 P2P-‐4(2)
Speedup
0
5
10
15
20
25
30
100 148 196 244 292 340 388 436 484 532 580 628 676 724 772 820 868
Speedu
p Factor
Cube Size
GPU vs 16-‐CPU P2P-‐4 vs 16-‐CPU P2P-‐4(2) vs 16-‐CPU
Wrapping up • Elas0c wave modeling represents a complex computa0on problem
– But, GPU is a promising route for accelera0on – Flexible mul0-‐GPU implementa0on offers considerable speedup on large grids
• Represents a seed for further research, modifica0ons, and extensions
• Check out our code: www.reproducibility.org – Exec: sfewefd2d_gpu, sfewefd3d_gpu_p2p, sfewefd3d_multiNode!– Source: user/rweiss – Examples: book/uwa/geo2013Elas0cModellingGPU/ – Improvements and extensions welcomed... ...bug reports welcome as well
S4599: An Adventure in Por2ng: Adding GPU-‐Accelera0on to Open-‐Source 3D
Elas0c Wave Modeling Robin M. Weiss and Jeffery Shragge
March 26, 2014
GPU Technology Conference