revisiting kirchhoff migration on gpus 2015 rice oil & gas hpc workshop rajesh gandham, rice...
TRANSCRIPT
Revisiting Kirchhoff Migration on GPUs
2015 Rice Oil & Gas HPC Workshop
Rajesh Gandham, Rice University & Hess Corporation (intern) Thomas Cullison, Hess Corporation
Scott Morton, Hess Corporation
Seismic Experiment
http://www.chevron.pl/images/timeline/rsImgSeismicImaging1.jpg
Kirchhoff Migration
x
y
z
t = Ts + Tr Add to Image
Image traceData trace
Image point
Source
Receiver
Ts
Tr
Seismic Image
Project Goals
• Hardware portability
• General image gathers
• Improve migration performance
Project Goals
• Hardware portability
• General image gathers
• Improve migration performance
OCCA for Portability
Portability Results
• Ported and tested production kernel from CUDA to OCCA in ~3 weeks• Tested and verified kernel results on CPU and GPU• Tested production migration on GPUs
• Performance• Greater kernel performance because of runtime compilation• Kernels still need some tuning for best performance on various
architectures
Project Goals
• Hardware portability
• General image gathers
• Improve migration performance
Standard Kirchhoff Imaging
• Pre-compute coarse travel times from surface locations to image points• 4D surface integral through a 5D data set to 3D image
• Computational complexity:
• NI ~ 1010, number of output image points
• ND ~ 109, number of input data traces
• f ~ 10, number of cycles/image-point/trace
• fNlND ~ 1020, number of cycles ~ 103 CPU core years
Kirchhoff Gather Imaging
• Pre-compute coarse travel times from surface locations to image points• 4D surface integral through a 5D data set to 4D/5D image
• Image Gathers• Offset • Offset vectors tile (OVT)• Subsurface angles• etc…
Project Goals
• Hardware portability
• General image gathers
• Improve migration performance
Previous Approach
• Define tasks that can be run in parallel
• Task should be small enough to fit on a GPU
• Copying data to and from the GPU is expensive
• Global memory access can be a bottleneck
Previous Approach
Previous Approach
↔
Previous Approach Overview
• ~32 traces per task• Big image block per task• One gather bin per task• Pre-filter the data• Resample the data• CUDA programming model
New Approach for Performance
• Define tasks that can be run in parallel
• Task should be small enough to fit on a GPU
• Copying data to and from the GPU is expensive
• Global memory access can be a bottleneck
• Improve FLOPs/load
New Approach
parallelism in image volumeparallelism in data traces
Parameter Analysis
100
400
700
1000
1300
1600
1900
2200
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
Computation-to-Memory Efficiency
Input Trace Block (m)
Migration-Contribu-tions/Byte
Output Image Block (m)
New Approach
• Implementation for general image gathers• Offset gather, OVT gather, reflection angle gather, etc.
• Produce a small chunk of image quickly• See imaging results as each task finishes
• Improve the overall performance on new hardware• The production code was optimized for CUDA and NVIDIA GPUs in 2008/2009
• Develop portable software• Hardware architectures change relatively fast• Several vendors and varieties of accelerators• Several parallel models for various languages
Production vs New Approach
• ~32 traces per task• Big image block per task• One gather bin per task• Pre-filter the data• Resample the data• CUDA programming model
• ~200k traces per task• Small image block per task• Multiple gather bins per task• Filter on the fly• Interpolate on the fly• OCCA programming approach
• Avoiding pre-filtering (for anti-aliasing) & resampling – Reduces memory overhead– Increase the number of computations per migration contribution– Greater FLOPs/Byte
Production vs New Approach
Output Image Block Length (m)
Mil
lio
n M
igra
tio
n-
Co
ntr
ibu
tio
ns/
s
• Input traces are fixed at ~177,000 (Nvidia K10)• Pre-filtering and resampling of production code is not included
500 1000 1500 20000
500
1000
1500
2000
2500
Production ApproachNew Approach
New Approach Outcomes
• Improved production performance best guess (~2X)
• Generalized gather kernel framework
• Portable implementation• Tested and verified CPU vs GPU results• Tested and compared OpenCL vs CUDA• Performance on AMD GPUs is similar to NVIDIA GPUs
New Approach Kernel: NVIDIA vs AMD
Output image volume size in meters
Mil
lio
n m
igra
tio
n
con
trib
uti
on
s/s
• Number of input traces ~177,000
500 1000 1500 20000
500
1000
1500
2000
2500
3000
OpenCL + K40CUDA + K40OpenCL + Tahiti
Project Goals Review
Hardware portability
General image gathers
Improve migration performance
• Finish integration of new kernel in to production
• More testing on various accelerators
• Explore using mixed architecture migrations
Future Work
Acknowledgements
• Hess Corporation
• CAAM @ Rice University• Tim Warburton• David Medina