revisiting kirchhoff migration on gpus 2015 rice oil & gas hpc workshop rajesh gandham, rice...

Revisiting Kirchhoff Migration on GPUs

2015 Rice Oil & Gas HPC Workshop

Rajesh Gandham, Rice University & Hess Corporation (intern) Thomas Cullison, Hess Corporation

Scott Morton, Hess Corporation

Seismic Experiment

http://www.chevron.pl/images/timeline/rsImgSeismicImaging1.jpg

http://www.chevron.pl/images/timeline/rsImgSeismicImaging1.jpg

Kirchhoff Migration

x

y

z

t = Ts + Tr Add to Image

Image traceData trace

Image point

Source

Receiver

Ts

Tr

Seismic Image

Project Goals

• Hardware portability

• General image gathers

• Improve migration performance

OCCA for Portability

Portability Results

• Ported and tested production kernel from CUDA to OCCA in ~3 weeks• Tested and verified kernel results on CPU and GPU• Tested production migration on GPUs

• Performance• Greater kernel performance because of runtime compilation• Kernels still need some tuning for best performance on various

architectures

Project Goals




Standard Kirchhoff Imaging

• Pre-compute coarse travel times from surface locations to image points• 4D surface integral through a 5D data set to 3D image

• Computational complexity:

• NI ~ 1010, number of output image points

• ND ~ 109, number of input data traces

• f ~ 10, number of cycles/image-point/trace

• fNlND ~ 1020, number of cycles ~ 103 CPU core years

Kirchhoff Gather Imaging

• Pre-compute coarse travel times from surface locations to image points• 4D surface integral through a 5D data set to 4D/5D image

• Image Gathers• Offset • Offset vectors tile (OVT)• Subsurface angles• etc…

Project Goals




Previous Approach

• Define tasks that can be run in parallel

• Task should be small enough to fit on a GPU

• Copying data to and from the GPU is expensive

• Global memory access can be a bottleneck

Previous Approach

Previous Approach

↔

Previous Approach Overview

• ~32 traces per task• Big image block per task• One gather bin per task• Pre-filter the data• Resample the data• CUDA programming model

New Approach for Performance

• Define tasks that can be run in parallel

• Task should be small enough to fit on a GPU

• Copying data to and from the GPU is expensive

• Global memory access can be a bottleneck

• Improve FLOPs/load

New Approach

parallelism in image volumeparallelism in data traces

Parameter Analysis

100

400

700

1000

1300

1600

1900

2200

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900

2000

2100

2200

2300

2400

Computation-to-Memory Efficiency

Input Trace Block (m)

Migration-Contribu-tions/Byte

Output Image Block (m)

New Approach

• Implementation for general image gathers• Offset gather, OVT gather, reflection angle gather, etc.

• Produce a small chunk of image quickly• See imaging results as each task finishes

• Improve the overall performance on new hardware• The production code was optimized for CUDA and NVIDIA GPUs in 2008/2009

• Develop portable software• Hardware architectures change relatively fast• Several vendors and varieties of accelerators• Several parallel models for various languages

Production vs New Approach

• ~32 traces per task• Big image block per task• One gather bin per task• Pre-filter the data• Resample the data• CUDA programming model

• ~200k traces per task• Small image block per task• Multiple gather bins per task• Filter on the fly• Interpolate on the fly• OCCA programming approach

• Avoiding pre-filtering (for anti-aliasing) & resampling – Reduces memory overhead– Increase the number of computations per migration contribution– Greater FLOPs/Byte

Production vs New Approach

Output Image Block Length (m)

Mil

lio

n M

igra

tio

n-

Co

ntr

ibu

tio

ns/

s

• Input traces are fixed at ~177,000 (Nvidia K10)• Pre-filtering and resampling of production code is not included

500 1000 1500 20000

500

1000

1500

2000

2500

Production ApproachNew Approach

New Approach Outcomes

• Improved production performance best guess (~2X)

• Generalized gather kernel framework

• Portable implementation• Tested and verified CPU vs GPU results• Tested and compared OpenCL vs CUDA• Performance on AMD GPUs is similar to NVIDIA GPUs

New Approach Kernel: NVIDIA vs AMD

Output image volume size in meters

Mil

lio

n m

igra

tio

n

con

trib

uti

on

s/s

• Number of input traces ~177,000

500 1000 1500 20000

500

1000

1500

2000

2500

3000

OpenCL + K40CUDA + K40OpenCL + Tahiti

Project Goals Review

Hardware portability

General image gathers

Improve migration performance

• Finish integration of new kernel in to production

• More testing on various accelerators

• Explore using mixed architecture migrations

Future Work

Acknowledgements

• Hess Corporation

• CAAM @ Rice University• Tim Warburton• David Medina

revisiting kirchhoff migration on gpus 2015 rice oil & gas hpc workshop rajesh gandham, rice...

Documents