porting scalable parallel cfd application hifun on …...s & i engineering solutions pvt. ltd.,...

HiFUN onGPU

Krishnababuet. al. Porting Scalable Parallel CFD Application

HiFUN on NVIDIA GPU

D. V. Krishnababu, N. Munikrishna, Nikhil Vijay Shende 1

N. Balakrishnan 2

Thejaswi Rao 3

1. S & I Engineering Solutions Pvt. Ltd., Bangalore, India2. Aerospace Engineering, Indian Institute of Science, Banglore, India

3. NVIDIA Graphics Pvt. Ltd., Banglore, India

GPU Technology ConferenceSilicon Valley

March 26–29, 20181 / 18

HiFUN onGPU

Krishnababuet. al.

Introductionhttp://www.sandi.co.in

The HiFUN SoftwareHigh Resolution Flow Solver on Unstructured Meshes.A Computational Fluid Dynamics (CFD) Flow Solver.Primary product of the company SandI.Robust, fast, accurate and efficient tool.

About SandIA technology company.Incubated from Indian Institute of Science, Bangalore.Promotes high end CFD technologies withuncompromising quality standards.

2 / 18

HiFUN onGPU

Krishnababuet. al.

Introductionhttp://www.sandi.co.in

The HiFUN SoftwareHigh Resolution Flow Solver on Unstructured Meshes.A Computational Fluid Dynamics (CFD) Flow Solver.Primary product of the company SandI.Robust, fast, accurate and efficient tool.

About SandIA technology company.Incubated from Indian Institute of Science, Bangalore.Promotes high end CFD technologies withuncompromising quality standards.

2 / 18

HiFUN onGPU

Krishnababuet. al.

Features of HiFUNhttp://www.sandi.co.in/home/products

General

3 / 18

HiFUN onGPU

Krishnababuet. al.

Well Validated

AIAA DPW SPICES

AIAA HiLiftPW4 / 18

HiFUN onGPU

Krishnababuet. al.

Super Scalable Workload: 165 Million Volumes

Simulation CPU Cores Time (Hours/Days)256 30/1.25

RANS10000 1

256 108/4.5URANS

10000 3256 525/22

DES10000 15

5 / 18

HiFUN onGPU

Krishnababuet. al.

SandI–NVIDIA Collaboration

2014 - Joint Development Initiative Kicks Off

2015 - NVIDIA Innovation Award

2016 -

GTCx Mumbai

HiFUN in GPU Apps Catalogue

GTC 2016: Poster Presentation

2018 - GTC 2018

WayAhead

-HiFUN on NVIDIA Pascal, Volta GPU

NVLink With IBM Power CPU

6 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Hybrid SupercomputersConsist of CPU and NVIDIA GPU.Less power to achieve same FLOPS.Less cooling & space.

GPUThousands of computing cores sharing same RAM.Higher memory bandwidth.High data transfer overheads with CPU.

7 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Hybrid SupercomputersConsist of CPU and NVIDIA GPU.Less power to achieve same FLOPS.Less cooling & space.

GPUThousands of computing cores sharing same RAM.Higher memory bandwidth.High data transfer overheads with CPU.

7 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Parallelization Model on GPUShared memory.Many FLOPS per byte of data from CPU to GPU.Re–look at parallelization of CFD algorithms.

Parallelization Challenges

General purpose algorithms.Implicit: Global data dependence.Complex multi–layered unstructured data structure.

8 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Parallelization Model on GPUShared memory.Many FLOPS per byte of data from CPU to GPU.Re–look at parallelization of CFD algorithms.

Parallelization Challenges

General purpose algorithms.Implicit: Global data dependence.Complex multi–layered unstructured data structure.

8 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

ConstraintsNo compromise on distributed memory scalability.Source code maintainability should not suffer.Software portability should not suffer.

Parallel Strategy

Accelerate single node performance via offload model.Hybrid: MPI and OpenACC directives.

Offload ModelComputationally intensive part is offloaded to GPU.Optimal data communication between CPU & GPU.

9 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Parallel Strategy

9 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Parallel Strategy

9 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Onera M6 NASA CRM Trap Wing

Configurations & Workloads (Million)

Onera M6 Wing: 1.1, 9.3, 12.12, 15.4NASA CRM: 6.2, 26.5, 30NASA Trap Wing: 20, 66

Simulation TypeSteady RANS Simulations

10 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Onera M6 NASA CRM Trap Wing

Configurations & Workloads (Million)

Onera M6 Wing: 1.1, 9.3, 12.12, 15.4NASA CRM: 6.2, 26.5, 30NASA Trap Wing: 20, 66

Simulation TypeSteady RANS Simulations

10 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Computing Platform: NVIDIA PSG

Node configurationTwo Hexa–deca core Intel(R) Xeon(R) Haswellprocessors.Eight NVIDIA Tesla K–80 GPUs.

GPU Memory = 12 GB.Total CPU Memory per node = 256 GB.Infiniband interconnect

SoftwarePGI Compiler 16.7OPENMPI 1.10.2OpenACC 2.0

11 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Computing Platform: NVIDIA PSG

Node configurationTwo Hexa–deca core Intel(R) Xeon(R) Haswellprocessors.Eight NVIDIA Tesla K–80 GPUs.

GPU Memory = 12 GB.Total CPU Memory per node = 256 GB.Infiniband interconnect

SoftwarePGI Compiler 16.7OPENMPI 1.10.2OpenACC 2.0

11 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPUParallel Performance Parameters

Ideal Speed–upRatio of number of nodes used for a given run to referencenumber of nodes.

Actual Speed–upRatio of time/iteration using reference number of nodes totime/iteration using number of nodes for given run.

Accelerator Speed–upRatio of time per iteration obtained using given no. of CPUsto time per iteration obtained using same no. of CPUsworking in tandem with GPUs.

12 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPUSingle Node Performance

Accelerator Speed–up on 2 GPU

ObservationsIncrease in grid size increases GPU utilization andaccelerator speed–up.Important to load GPU completely.

13 / 18

HiFUN onGPU

Krishnababuet. al.

Varying GPUs % Increase

ObservationsIncrease in no. of GPUs increase acceleratorspeed–up.Use of 4 GPUs per node is optimal.

14 / 18

HiFUN onGPU

Krishnababuet. al.

Time to RANS Solution (Hours)

ObservationsTime to solution on 1 million grid ∼ 15 minutes.Time to solution on 30 million grid ∼ half a day.Single node serves as a desktop supercomputer.

15 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPUMulti–node Performance

Parallel Speed–up: 66 Million Workload

ObservationsNear linear speed–up using 2 GPUs per node.Drop in speed–up for larger no. nodes and/or higherGPUs due to lower GPU utilization.

16 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPUMulti–node Performance

Normalized Time Per Iteration: 66 Million WorkloadObservations

Drop in time/iter with increase in no. of nodes and/orGPUs.Time to solution with 8 nodes ∼ 4 hours.

17 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Concluding Remarks

Offload model to port HiFUN on GPU.GPU based computing node is powerful enough toserve as desktop supercomputer.HiFUN is ideally suited to solve grand challengeproblems on GPU based hybrid supercomputers.OpenACC directives based offload model is anattractive option for porting legacy CFD codes on GPU.

18 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Concluding Remarks

18 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Concluding Remarks

18 / 18

HiFUN onGPU

Krishnababuet. al.

HiFUN on NVIDIA GPU

Concluding Remarks

18 / 18

porting scalable parallel cfd application hifun on …...s & i engineering solutions pvt. ltd.,...

Documents