cuda fortran: porting scientific research codes to gpus ... · cuda c vs. cuda fortran getting...

37
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente HPC Advisory Council Workshop, Stanford, CA, February 2018

Upload: others

Post on 31-Mar-2020

35 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Porting Scientific Research Codes to GPUs with CUDA Fortran:Incompressible Fluid Dynamics using the Immersed Boundary Method

Josh Romero, Massimiliano Fatica - NVIDIAVamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente

HPC Advisory Council Workshop, Stanford, CA, February 2018

Page 2: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Outline

● Introduction and Motivation

● Solver Details

● GPU implementation in CUDA Fortran

● Benchmarking and Results

● Conclusions

Page 3: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Introduction and Motivation

● Increased availability of GPU compute resources:○ Explosion of interest in Machine Learning○ Focus on energy efficiency for exascale

● Lots of choices to make:○ OpenACC vs. CUDA○ CUDA C vs. CUDA Fortran

● Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

● Talk is focused on getting up and running with “low-effort.”

Page 4: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Solver Details

Page 5: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Solver Details

● Incompressible CFD solver for DNS computations in structured domains

● IB + structural solver using method described in [1]

○ Immersed interface contributes forcing term to fluid

○ Interface structural dynamics treated as triangulated network of springs

[1] Spandan et al., Journal of Computational Physics, 2017

Page 6: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Solver Details

InitializeSolver Compute RK step Compute IB

forcing term Structural update

RK Loop

Timestep Loop

Page 7: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

GPU Implementation in CUDA Fortran

Page 8: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

CUDA Fortran

● Baseline CPU code is written in Fortran so natural choice for GPU port is CUDA Fortran.

● Benefits:○ More control than OpenACC:

■ Explicit GPU kernels written natively in Fortran are supported■ Full control of host/device data movement

○ Directive-based programming available via CUF kernels○ Easier to maintain than mixed CUDA C and Fortran approaches

● Requires PGI compiler (community edition available now for free)

Page 9: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Profiling with NVPROF + NVVP + NVTX

● NVPROF:○ Can be used to gather detailed kernel properties and timing information

● NVIDIA Visual Profiler (NVVP):○ Graphical interface to visualize and analyze NVPROF generated profiles○ Does not show CPU activity out of the box

● NVIDIA Tools EXtension (NVTX) markers:○ Enables annotation with labeled ranges within program○ Useful for categorizing parts of profile to put activity into context○ Can be used to visualize normally hidden CPU activity (e.g. MPI communication)

Page 10: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

NVIDIA Visual Profiler with NVTX Markers

Page 11: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

GPU Porting of Key Computational Routines

● In many CFD (and similar) codes, common code patterns appear:

○ Tightly-nested loop computations (computation of derivatives using stencils)

○ Common mathematical computations (Fourier transforms, matrix-algebra)

● But there are also unique patterns specific to a given application:

○ Computation of IB forcing on flow field

○ Computation of interface structural forces

Page 12: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 1: Tightly-nested loops

Consider the original CPU subroutine to compute the divergence.

subroutine divg use param use local_arrays, only: q1, q2, q3,& dph, jpv, ipv,& udx3m ...

do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)

dph(ic,jc,kc) = dqcap*usdtal

enddo enddo enddo

end subroutine divg

Page 13: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 1: Tightly-nested loops

Now, consider the version for GPU using CUF kernel directives.

subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)

dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)

dph(ic,jc,kc) = dqcap*usdtal

enddo enddo enddo

end subroutine divg

Page 14: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 1: Tightly-nested loops

● CUF kernel directive automatically generates GPU kernels for tightly nested loops.

● Scalar data passed by value to device.

● Array data must already be resident on device.

subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)

dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)

dph(ic,jc,kc) = dqcap*usdtal

enddo enddo enddo

end subroutine divg

Page 15: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 1: Tightly-nested loops

● For getting data onto the device, CUDA Fortran allows for straightforward declaration/allocation of device data.

subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)

dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)

dph(ic,jc,kc) = dqcap*usdtal

enddo enddo enddo

end subroutine divg

module local_arrays real(8), allocatable :: q1(:,:,:) real(8), device, allocatable :: q1_d(:,:,:) ...end module local_arrays

allocate(q1(nx,ny,nz)); q1 = 0.d0allocate(q1_d(nx,ny,nz); q1_d = q1

Alternative using sourced allocation:allocate(q1_d, source = q1)

Page 16: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Additional CUF kernel features

● CUF kernels can be used to perform reductions of scalar device data.

● Final reduced result can be on the host or device.

subroutine calculate_volume_gpu (Volume,nv,nf,xyz,vert_of_face) integer, dimension (3,nf), device, intent(in) :: vert_of_face real(8), dimension (nv,3), device, intent(in) ::xyz real(8), intent(out) :: Volume ... Volume = 0.d0

!$cuf kernel do (1) do i = 1,nf v1 = vert_of_face(1,i) v2 = vert_of_face(2,i) v3 = vert_of_face(3,i)

x1 = xyz(v1,1); x2 = xyz(v2,1); x3 = xyz(v3,1) y1 = xyz(v1,2); y2 = xyz(v2,2); y3 = xyz(v3,2) z1 = xyz(v1,3); z2 = xyz(v2,3); z3 = xyz(v3,3)

Volume = Volume + (x1 * (y2*z3 - z2*y3) + & x2 * (y3*z1 - z3*y1) + & x3 * (y1*z2 - z1*y2))/6.d0 enddo

end subroutine calculate_volume_gpu

Page 17: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 2: Common Mathematical Computations

● Beyond loop-based computations, many codes use common math computations for which there are GPU libraries readily available:

○ FFT: CUFFT○ BLAS: CUBLAS○ Linear Algebra: CUSOLVER

● Use wisely: Favor batched implementations when available, avoid many repeated calls of small operations

Page 18: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 2: Common Mathematical Computations

Consider the original CPU code for completing a real-to-complex FFT using FFTW library.

coefnorm = 1.d0/(dble(n1m) * dble(n2m))

do k = kstart,kend do j = 1,n2m do i = 1,n1m xr(j,i) = dph(i,j,k) enddo enddo

call dfftw_execute_dft_r2c(fwd_plan,xr,xa)

do j = 1,n2m/2 + 1 do i = 1,n1m dpho(i,j,k) = dreal(xa(j,i)) * coefnorm dpho(i,j+n2mh,k) = dimag(xa(j,i)) * coefnorm enddo enddo

end do

Page 19: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 2: Common Mathematical Computations

Now consider the version for GPU using CUFFT library.

● Modified to use batched 2D FFTs

● Final loop merged with later packing loop ← kernel fusion

coefnorm = 1.d0/(dble(n1m) * dble(n2m))

!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo

istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Page 20: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 2: Common Mathematical Computations

Now consider the version for GPU using CUFFT library.

● Modified to use batched 2D FFTs

● Final loop merged with later packing loop ← kernel fusion

coefnorm = 1.d0/(dble(n1m) * dble(n2m))

!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo

istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

integer :: cufft_fwd_planinteger :: rank(2), inembed(2), onembed(2)

rank(1) = n1m; rank(2) = n2minembed(1) = n1m; inembed(2) = n2monembed(1) = n1m; onembed(2) =n2m/2 + 1

istat = cufftPlanMany(cufft_fwd_plan, 2, rank, inembed, 1, & n1m*n2m, onembed, 1, n1m*(n2m/2 + 1),& CUFFT_D2Z, kend-kstart+1)

Page 21: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Interfaces for BLAS routines

● PGI provides overloaded interfaces for BLAS routines.

● Calls with device-resident arrays are automatically passed to the CUBLAS library.

use cudaforuse cublas

integer :: m, n, kreal(8) :: alpha, betareal(8) :: a(m,k), b(k,n), c(m,n)real(8),device :: a_d(m,k), b_d(k,n), c_d(m,n)

...

! DGEMM using linked CPU librarycall DGEMM(‘N’, ‘N’, m, n, k, alpha, a, m, b, k, & beta, c, m)

! DGEMM using CUBLAScall DGEMM(‘N’, ‘N’, m, n, k, alpha, a_d, m, b_d, k, & beta, c_d, m)

Page 22: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Case 3: Unique computations

● The need for custom kernels arises in most programs:

○ Unique computations not amenable to a CUF kernel

○ Common mathematical operation, but no good GPU library implementation:

■ Tridiagonal LU factorization/solves with multiple RHS

○ Pattern of library usage that would be poor performing on GPU:

■ Data interpolation from flow grid to structural grid involves many small matrix and vector computations.

Page 23: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Example 1: Batched Tridiagonal Solver

● Flow solver requires tridiagonal LU factorization/solves with multiple RHS

● Wrote batched tridiagonal solver using Thomas algorithm

● One GPU thread assigned per RHS

● To ensure coalesced access of RHS values by threads, data transposition required:

rhs_d(1:N1*N2, 1:NRHS) → rhs_t_d(1:NRHS, 1:N1*N2)

Page 24: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Example 2: Data Interpolation Between Grids

This is the most time consuming operation in the IBM portion of the solver.

Goal is to compute interpolated value on structural grid from flow grid.

Page 25: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Example 2: Data Interpolation Between Grids

For a given triangle i:● Form 27-point support domain

around triangle centroid. ● Compute transfer function,

using support point and centroid data.

Final centroid result scattered back to support points or to triangle vertices.

Page 26: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Example 2: Data Interpolation Between Grids

For a given triangle i:● Form 27-point support domain

around triangle centroid. ● Compute transfer function,

using support point and centroid data.

Final centroid result scattered back to support points or to triangle vertices.

Page 27: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Example 2: Data Interpolation Between Grids

Computation of transfer function for each triangle requires:

● 4 x 4 matrix inversion● Several small matrix-vector

multiplies:○ [1 x 4][4 x 4] and [1 x 4][4 x 27]

Final computation of is an inner product of 27 values.

Page 28: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Example 2: Data Interpolation Between Grids

GPU strategy:● Process each triangle using a

warp (32 thread unit), map threads to support points

● Data is warp-local → most matrix algebra can be completed efficiently using warp shuffle intrinsics.

● Scattering of final result completed using atomic adds.

Page 29: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Benchmarking and Results

Page 30: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Verification Case

Page 31: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Benchmarking Case

● Unit cube, quiescent flow

● N = 128, 256, 384

● # of Particles = 1, 8, 27, 64

● Particle Resolution= 1280, 5120, 20480 triangles

● Run on:○ 1x 16-core Intel(R) Xeon(R) CPU

E5-2698 v3 @ 2.30GHz

○ 1x NVIDIA Tesla V100 PCIe

Page 32: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Grid Resolution

Fixed # of Particles = 8Particle Resolution = 5120 triangles

Fluid: ● 10 to 14x speedup vs.

CPU

IB + Structural: ● 40 to 100x speedup vs.

CPU

● Percentage of time:○ CPU: 72% to 14%○ GPU: 20% to 6%

Page 33: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Particle Resolution

Fixed N = 256Fixed # of Particles = 8

IB + structural solver time increases at reduced rate on GPU:

● CPU: 15% to 55%● GPU: 6% to 13%

Page 34: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Number of Particles

Fixed N = 256Particle Resolution = 5120 triangles

IB + Structural solver time increases at similar rates:

● CPU: 14% to 59%● GPU: 5% to 22%

Page 36: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Conclusions

Page 37: CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Conclusions

● Porting research codes to GPUs is worth the investment○ Faster runtimes enable larger cases, more rapid experimentation

● Large performance gains can be achieved with low effort using CUDA Fortran○ CUF kernel directives○ CUDA-enabled libraries○ Custom kernels when all else fails

● Working with developers to apply current code to challenging research cases

● Some previous work with these developers can be found on GitHub: https://github.com/PhysicsofFluids/AFiD_GPU_opensource