javier cabezas mauricio araya isaac gelado thomas bradley gladys gonzález josé maría cela

56
Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Reverse Time Migration on GMAC NVIDIA GTC 22 nd of September, 2010 BSC Repsol/ BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC

Upload: colum

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Reverse Time Migration on GMAC. Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro. BSC Repsol /BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC. NVIDIA GTC 22 nd of September, 2010. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

Javier CabezasMauricio ArayaIsaac GeladoThomas BradleyGladys GonzálezJosé María CelaNacho Navarro

Reverse Time Migrationon GMAC

NVIDIA GTC22nd of September, 2010

BSCRepsol/BSCUPC/UIUCNVIDIARepsolUPC/BSCUPC/BSC

Page 2: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 2

Outline

•Introduction

•Reverse Time Migration on CUDA

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions

Page 3: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 3

Reverse Time Migration on CUDA

•RTM generates an image of the subsurface layers

•Uses traces recorded by sensors in the field

•RTM’s algorithm1.Propagation of a modeled wave (forward in time)

2.Propagation of the recorded traces (backward in time)

3.Correlation of the forward and backward wavefields

• Last forward wavefield with the first backward wavefield

•FDTD are preferred to FFT

• 2nd-order finite differencing in time

• High-order finite differencing in space

└ RTM

Page 4: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 4

Introduction

•BSC and Repsol: Kaleidoscope project

• Develop better algorithms/techniques for seismic imaging

• We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration

•Due to the high computational power required, the project started a quest for the most suitable hardware

• PowerPC: scalability issues

• Cell: good performance (in production @ Repsol), difficult programmability

• FPGA: potentially best performance, programmability nightmare

• GPUs: 5x speedup vs Cell (GTX280), what about programmability?

└ Barcelona Supercomputing Center (BSC)

Page 5: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 5

Outline

•Introduction

•Reverse Time Migration on CUDA

→General approach

• Disk I/O

• Domain decomposition

• Overlapping computation and communication

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions

Page 6: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 6

Reverse Time Migration on CUDA

•We focus on the host-side part of the implementation

1.Avoid memory transfers between host and GPU memories

• Implement on the GPU as many computations as possible

2.Hide latency of memory transfers

• Overlap memory transfers and kernel execution

3.Take advantage of the PCIe full-duplex capabilities (Fermi)

• Overlap deviceToHost and hostToDevice memory transfers

└ General approach

Page 7: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 7

Reverse Time Migration on CUDA└ General approach

3D-Stencil

Absorbing Boundary Conditions

Source insertion

Compression

Write to disk

3D-Stencil

Absorbing Boundary Conditions

Traces insertion

Decompression

Read from disk

Correlation

Forward Backward

Page 8: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 8

Reverse Time Migration on CUDA

•Data structures used in the RTM algorithm

• Read/Write structures

• 3D volume for the wavefield (can be larger than 1000x1000x1000 points)

• State of the wavefiled in previous time-steps to compute finite differences in time

• Some extra points in each direction at the boundaries (halos)

• Read-Only structures

• 3D volume of the same size as the wavefield

• Geophones’ recorded traces: time-steps x #geophones

└ General approach

Page 9: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 9

Reverse Time Migration on CUDA

•Data flow-graph (forward)

└ General approach

3D-Stencil ABC

Source Compress

WavefieldsConstant read-only data: velocity model, geophones’ traces

Page 10: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 10

Reverse Time Migration on CUDA

•Simplified data flow-graph (forward)

└ General approach

RTM Kernel

Compress

Wave-fieldsConstant read-only data: velocity model, geophones’ traces

Page 11: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 11

Reverse Time Migration on CUDA

•Control flow-graph (forward)

• RTM Kernel Computation

• Compress and transfer to disk

• deviceToHost + Disk I/O

• Performed every N steps

• Can run in parallel withthe next compute steps

└ General approach

RTM Kernel

i%N == 0

i < steps

no

yes

yes

no

Compress

Disk I/O

End

Start

i = 0

i++

toHost

Runs on the GPURuns on the CPU

Page 12: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 12

Outline

•Introduction

•Reverse Time Migration on CUDA

• General approach

→Disk I/O

• Domain decomposition

• Overlapping computation and communication

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions

Page 13: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 13

Reverse Time Migration on CUDA

•GPU → Disk transfers are very time-consuming

•Transferring to disk can be overlapped with the next (compute-only) steps

└ Disk I/O

K1

K2

K3

K4 Disk I/O K

5C

K1

K2

K3

K4

Disk I/O

K5C K

6K7

K8

time

time

toHost

toHost

Runs on the GPURuns on the CPU

Page 14: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 14

Reverse Time Migration on CUDA

•Single transfer: wait for all the data to be in host memory

•Multiple transfers: overlap deviceToHost transfers with disk I/O

• Double buffering

└ Disk I/O

deviceToHost

time

Disk I/O

toH

time

toH

toH

toH

Disk I/O Disk I/O Disk I/O Disk I/O

Page 15: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 15

Reverse Time Migration on CUDA

•CUDA-RT limitations

• GPU memory accessible by the owner host thread only

→deviceToHost transfers must be performed by the compute thread

└ Disk I/O

CPU addressspace

GPU

GPU addressspace

Computethread

I/Othread

Page 16: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 16

Reverse Time Migration on CUDA

•CUDA-RT Implementation (single transfer)

• CUDA streams must be used not to block GPU execution

→Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory!

└ Disk I/O

CPU addressspace

GPU addressspace

GPU

Page 17: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 17

GPU

Reverse Time Migration on CUDA

•CUDA-RT Implementation (multiple transfers)

• Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU

→Lots of synchronization code in the compute thread

└ Disk I/O

CPU addressspace

GPU addressspace

Page 18: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 18

Outline

•Introduction

•Reverse Time Migration on CUDA

• General approach

• Disk I/O

→Domain decomposition

• Overlapping computation and communication

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions

Page 19: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 19

Reverse Time Migration on CUDA

•But… wait, real-size problems require > 16GB of data!

•Volumes are split into tiles (along the Z-axis)

• 3D-Stencil introduces data dependencies

└ Domain decomposition

y zx

D1

D2

D3

D4

Page 20: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 20

Reverse Time Migration on CUDA

•Multi-node may be required to overcome memory capacity limitations

• Shared memory for intra-node communication

• MPI for inter-node communication

└ Domain decomposition

GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4

Node 1 Node 2

MPIHost Memory Host Memory

Page 21: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 21

Reverse Time Migration on CUDA

•Data flow-graph (multi-domain)

└ Domain decomposition

RTM KernelCompress

RTM KernelCompress

Wave-fields (domain 1)Constant read-only data: velocity model, geophones’ traces

Wave-fields (domain 2)

Page 22: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 22

Reverse Time Migration on CUDA

•Control flow-graph (multi-domain)

• Boundary exchange everytime-step

• Inter-domain communicationblocks execution of the nextsteps!

└ Domain decomposition

Kernel

s%N == 0

i < steps

no

yes

yes

no

Compress

End

Start

i = 0

Exchange

sync

i++ Disk I/O

toHost

Runs on the GPURuns on the CPU

Page 23: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 23

Reverse Time Migration on CUDA

•Boundary exchange every time-step is needed

└ Domain decomposition

K1 CK

2K3

K4

K5

K6

K7

Disk I/O

toHost

time

X X X X XX

Page 24: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 24

Reverse Time Migration on CUDA

•Single-transfer exchange

• “Easy” to program, needs large page-locked buffers

•Multiple-transfer exchange to maximize PCI-Express utilization

• “Complex” to program, needs smaller page-locked buffers

└ Domain decomposition

toH

toH

toH

toH

deviceToHost

hostToDevice

deviceToHost

hostToDevice

deviceToHost

hostToDevice

toH

toH

toH

toH

toH

toH

toH

toH

toD

toD

toD

toD

toD

toD

toD

toD

toD

toD

toD

toD

time

time

Page 25: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 25

GPU1

GPU2

GPU3

GPU4

Reverse Time Migration on CUDA

•CUDA-RT limitations

• Each host thread can only access to the memory objects it allocates

└ Domain decomposition

CPU addressspace

GPUs’ addressspaces

Page 26: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 26

Reverse Time Migration on CUDA

•CUDA-RT implementation (single-transfer exchange)

• Streams and page-locked memory buffers must be used

• Page-locked memory buffers can be too big

└ Domain decomposition

CPU addressspace

GPU1

GPU2

GPU3

GPU4

GPUs’ addressspaces

Page 27: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 27

•CUDA-RT implementation (multiple-transfer exchange)

• Uses small page-locked buffers

• More synchronization code

•Too complex to be represented using Powerpoint!

•Very difficult to implement in real code!

└ Domain decomposition

Page 28: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 28

Outline

•Introduction

•Reverse Time Migration on CUDA

• General approach

• Disk I/O

• Domain decomposition

→Overlapping computation and communication

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions

Page 29: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 29

Reverse Time Migration on CUDA

•Problem: boundary exchange blocks the execution of the following time-step

└ Overlapping computation and communication

K1 CK

2K3

K4

K5

K6

K7

Disk I/O

toHost

time

X X X X XX

Page 30: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 30

Reverse Time Migration on CUDA

•Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains

└ Overlapping computation and communication

k1

X

K1

k2

X

K2

k3

X

K3

k4

X

K4

k5

X

K5

k6

X

K6

k7

X

K7

k8

X

K8

time

k9

K9C C

X

Disk I/O

toHost Disk I/OtoHo

st

Disk I/O

Page 31: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 31

Reverse Time Migration on CUDA└ Overlapping computation and communication

GPU1 GPU2

y zx

•Approach: two-stage execution

• Stage 1: compute the wavefield points to be exchanged

Page 32: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 32

Reverse Time Migration on CUDA└ Overlapping computation and communication

GPU1 GPU2

y zx

•Approach: two-stage execution

• Stage 2: Compute the remaining points while exchanging the boundaries

Page 33: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 33

Reverse Time Migration on CUDA

•But two-stage execution requires more abstractions and code complexity

• An additional stream per domain

• We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries

→At this point the code is a complete mess!

• Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization

• Poor readability and maintainability

• Easy to introduce bugs

└ Overlapping computation and communication

Page 34: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 34

Outline

•Introduction

•Reverse Time Migration on CUDA

•GMAC at a glance

→Features

• Code examples

•Reverse Time Migration on GMAC

•Conclusions

Page 35: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 35

GMAC at a glance

•Library that enhances the host programming model of CUDA

•Freely available at http://code.google.com/p/adsm/

• Developed by BSC and UIUC

• NCSA license (BSD-like)

• Works in Linux and MacOS X (Windows version coming soon)

•Presented in detail tomorrow at 9 am @ San Jose Ballroom

└ Introduction

Page 36: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 36

GMAC at a glance

•Unified virtual address space for all the memories in the system

• Single allocation for shared objects

• Special API calls: gmacMalloc, gmacFree

• GPU memory allocated by a host thread is visible to all host threads

→Brings POSIX thread semantics back to developers

└ Features

CPU

Memory

GPU

Shared Data

CPU Data

Page 37: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 37

GMAC at a glance

•Parallelism exposed via regular POSIX threads

• Replaces the explicit use of CUDA streams

• OpenMP support

•GMAC uses streams and page-locked buffers internally

• Concurrent kernel execution and memory transfers for free

└ Features

GPU

Page 38: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 38

GMAC at a glance

•Optimized bulk memory operations via library interposition

• File I/O

• Standard I/O functions: fwrite, fread

• Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers

• Optimized GPU to GPU transfers via regular memcpy

• Enhanced versions of the MPI send/receive calls

└ Features

Page 39: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 39

Outline

•Introduction

•Reverse Time Migration on CUDA

•GMAC at a glance

• Features

→Code examples

•Reverse Time Migration on GMAC

•Conclusions

Page 40: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 40

GMAC at a glance

•Single allocation (and pointer) for shared objects

└ Examples

void compute(FILE *file, int size){1 float *foo, *dev_foo;2 foo = malloc(size);3 fread(foo, size, 1, file);4 cudaMalloc(&dev_foo, size);5 cudaMemcpy(dev_foo, foo, size, ToDevice);6 kernel<<<Dg, Db>>>(dev_foo, size);7 cudaThreadSynchronize();8 cudaMemcpy(foo, dev_foo, size, ToHost);9 cpuComputation(foo);10 cudaFree(dev_foo);11 free(foo);}

void compute(FILE *file, int size){1 float *foo;2 foo = gmacMalloc(size);3 fread(foo, size, 1, file);456 kernel<<<Dg, Db>>>(foo, size);7 gmacThreadSynchronize();89 cpuComputation(foo);10 gmacFree(foo);11}

CUDA-RT GMAC

Page 41: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 41

GMAC at a glance

•Optimized support for bulk memory operations

└ Examples

void compute(FILE *file, int size){1 float *foo, *dev_foo;2 foo = malloc(size);3 fread(foo, size, 1, file);4 cudaMalloc(&dev_foo, size);5 cudaMemcpy(dev_foo, foo, size, ToDevice);6 kernel<<<Dg, Db>>>(dev_foo, size);7 cudaThreadSynchronize();8 cudaMemcpy(foo, dev_foo, size, ToHost);9 cpuComputation(foo);10 cudaFree(dev_foo);11 free(foo);}

void compute(FILE *file, int size){1 float *foo;2 foo = gmacMalloc(size);3 fread(foo, size, 1, file);456 kernel<<<Dg, Db>>>(foo, size);7 gmacThreadSynchronize();89 cpuComputation(foo);10 gmacFree(foo);11}

CUDA-RT GMAC

Page 42: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 42

Outline

•Introduction

•GMAC at a glance

•Reverse Time Migration on GMAC

→Disk I/O

• Domain decomposition

• Overlapping computation and communication

• Development cycle and debugging

•Conclusions

Page 43: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 43

GPU

Reverse Time Migration on GMAC

•CUDA-RT Implementation (multiple transfers)

• Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU

→Lots of synchronization code in the compute thread

└ Disk I/O

CPU addressspace

GPU addressspace

Page 44: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 44

Reverse Time Migration on GMAC

•GMAC implementation

• deviceToHost transfers performed by the I/O thread

• deviceToHost and Disk I/O transfers overlap for free

• Small page-locked buffers are used

└ Disk I/O (GMAC)

Global addressspace

GPU

Page 45: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 45

Outline

•Introduction

•GMAC at a glance

•Reverse Time Migration on GMAC

• Disk I/O

→Domain decomposition

• Overlapping computation and communication

• Development cycle and debugging

•Conclusions

Page 46: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 46

Reverse Time Migration on GMAC

•CUDA-RT implementation (single-transfer exchange)

• Streams and page-locked memory buffers must be used

• Page-locked memory buffers can be too big

└ Domain decomposition (CUDA-RT)

CPU addressspace

GPU2

GPU3

GPU4

GPUs’ addressspaces

GPU1

Page 47: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 47

•GMAC implementation (multiple-transfer exchange)

• Exchange of boundaries performed using a simple memcpy!

• Full PCIe utilization: internally GMAC performs several transfers and double buffering

Reverse Time Migration on GMAC└ Domain decomposition (GMAC)

Unified globaladdress space

GPU1

GPU3

GPU4

GPU2

Page 48: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 48

Outline

•Introduction

•GMAC at a glance

•Reverse Time Migration on GMAC

• Disk I/O

• Domain decomposition

→Overlapping computation and communication

• Development cycle and debugging

•Conclusions

Page 49: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 49

Reverse Time Migration on GMAC

•No streams, no page-locked buffers, similar performance: ±2%

└ Overlapping computation and communication

readVelocity(velociy);cudaMalloc(&d_input, W_SIZE);cudaMalloc(&d_output, W_SIZE);cudaHostAlloc(&i_halos, H_SIZE);cudaHostAlloc(&disk_buffer, W_SIZE);cudaStreamCreate(&s1);cudaStreamCreate(&s2);cudaMemcpy(d_velocity, velocity, W_SIZE)for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointersend for

fread(velocity);gmacMalloc(&input, W_SIZE);gmacMalloc(&output, W_SIZE);

for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output);

barrier_write_to_disk(); } // ... Update pointersend forCUDA-RT GMAC

Page 50: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 50

Outline

•Introduction

•GMAC at a glance

•Reverse Time Migration on GMAC

• Disk I/O

• Domain decomposition

• Inter-domain communication

→Development cycle and debugging

•Conclusions

Page 51: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 51

Reverse Time Migration on GMAC└ Development cycle and debugging

3D-Stencil

Absorbing Boundary Conditions

Source insertion

Compression

•CUDA-RT

• Start from a simple, correct sequential code

• Implement kernels one at a time and checkcorrectness

• Two allocations per data structure

• Keep data consistency by hand (cudaMemcpy)

• To introduce modifications to any kernel

• Two allocations per data structure

• Keep data consistency by hand (cudaMemcpy)

Page 52: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 52

Reverse Time Migration on GMAC

•GMAC

• Allocate objects with gmacMalloc

• Single pointer

• Use pointer both in the host and GPU kernelimplementations

• No copies

└ Development cycle and debugging

3D-Stencil

Absorbing Boundary Conditions

Source insertion

Compression

Page 53: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 53

Outline

•Introduction

•Reverse Time Migration on CUDA

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions

Page 54: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 54

Conclusions

•Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM

•CUDA has programmability issues

• CUDA provides a good language to expose data parallelism in the code to be run on the GPU

• The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations

GMAC eases the development of applications for GPU-based systems with no performance penalty6-month part-time single programmer: full RTM version (5x speedup

over the previous Cell implementation)

Page 55: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 55

Acknowledgements

•Barcelona Supercomputing Center

•Repsol

•Universitat Politècnica de Catalunya

•University of Illinois at Urbana-Champaign

Page 56: Javier  Cabezas Mauricio Araya Isaac  Gelado Thomas Bradley Gladys  González José  María Cela

NVIDIA GPU Technology Conference – 22nd of September, 2010 56

Thank you!

Questions?