porting scalable parallel cfd application hifun on …...s & i engineering solutions pvt. ltd.,...
TRANSCRIPT
HiFUN onGPU
Krishnababuet. al. Porting Scalable Parallel CFD Application
HiFUN on NVIDIA GPU
D. V. Krishnababu, N. Munikrishna, Nikhil Vijay Shende 1
N. Balakrishnan 2
Thejaswi Rao 3
1. S & I Engineering Solutions Pvt. Ltd., Bangalore, India2. Aerospace Engineering, Indian Institute of Science, Banglore, India
3. NVIDIA Graphics Pvt. Ltd., Banglore, India
GPU Technology ConferenceSilicon Valley
March 26–29, 20181 / 18
HiFUN onGPU
Krishnababuet. al.
Introductionhttp://www.sandi.co.in
The HiFUN SoftwareHigh Resolution Flow Solver on Unstructured Meshes.A Computational Fluid Dynamics (CFD) Flow Solver.Primary product of the company SandI.Robust, fast, accurate and efficient tool.
About SandIA technology company.Incubated from Indian Institute of Science, Bangalore.Promotes high end CFD technologies withuncompromising quality standards.
2 / 18
HiFUN onGPU
Krishnababuet. al.
Introductionhttp://www.sandi.co.in
The HiFUN SoftwareHigh Resolution Flow Solver on Unstructured Meshes.A Computational Fluid Dynamics (CFD) Flow Solver.Primary product of the company SandI.Robust, fast, accurate and efficient tool.
About SandIA technology company.Incubated from Indian Institute of Science, Bangalore.Promotes high end CFD technologies withuncompromising quality standards.
2 / 18
HiFUN onGPU
Krishnababuet. al.
Features of HiFUNhttp://www.sandi.co.in/home/products
General
3 / 18
HiFUN onGPU
Krishnababuet. al.
Features of HiFUNhttp://www.sandi.co.in/home/products
Well Validated
AIAA DPW SPICES
AIAA HiLiftPW4 / 18
HiFUN onGPU
Krishnababuet. al.
Features of HiFUNhttp://www.sandi.co.in/home/products
Super Scalable Workload: 165 Million Volumes
Simulation CPU Cores Time (Hours/Days)256 30/1.25
RANS10000 1
256 108/4.5URANS
10000 3256 525/22
DES10000 15
5 / 18
HiFUN onGPU
Krishnababuet. al.
SandI–NVIDIA Collaboration
2014 - Joint Development Initiative Kicks Off
2015 - NVIDIA Innovation Award
2016 -
GTCx Mumbai
HiFUN in GPU Apps Catalogue
GTC 2016: Poster Presentation
2018 - GTC 2018
WayAhead
-HiFUN on NVIDIA Pascal, Volta GPU
NVLink With IBM Power CPU
6 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Hybrid SupercomputersConsist of CPU and NVIDIA GPU.Less power to achieve same FLOPS.Less cooling & space.
GPUThousands of computing cores sharing same RAM.Higher memory bandwidth.High data transfer overheads with CPU.
7 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Hybrid SupercomputersConsist of CPU and NVIDIA GPU.Less power to achieve same FLOPS.Less cooling & space.
GPUThousands of computing cores sharing same RAM.Higher memory bandwidth.High data transfer overheads with CPU.
7 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Parallelization Model on GPUShared memory.Many FLOPS per byte of data from CPU to GPU.Re–look at parallelization of CFD algorithms.
Parallelization Challenges
General purpose algorithms.Implicit: Global data dependence.Complex multi–layered unstructured data structure.
8 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Parallelization Model on GPUShared memory.Many FLOPS per byte of data from CPU to GPU.Re–look at parallelization of CFD algorithms.
Parallelization Challenges
General purpose algorithms.Implicit: Global data dependence.Complex multi–layered unstructured data structure.
8 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
ConstraintsNo compromise on distributed memory scalability.Source code maintainability should not suffer.Software portability should not suffer.
Parallel Strategy
Accelerate single node performance via offload model.Hybrid: MPI and OpenACC directives.
Offload ModelComputationally intensive part is offloaded to GPU.Optimal data communication between CPU & GPU.
9 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
ConstraintsNo compromise on distributed memory scalability.Source code maintainability should not suffer.Software portability should not suffer.
Parallel Strategy
Accelerate single node performance via offload model.Hybrid: MPI and OpenACC directives.
Offload ModelComputationally intensive part is offloaded to GPU.Optimal data communication between CPU & GPU.
9 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
ConstraintsNo compromise on distributed memory scalability.Source code maintainability should not suffer.Software portability should not suffer.
Parallel Strategy
Accelerate single node performance via offload model.Hybrid: MPI and OpenACC directives.
Offload ModelComputationally intensive part is offloaded to GPU.Optimal data communication between CPU & GPU.
9 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Onera M6 NASA CRM Trap Wing
Configurations & Workloads (Million)
Onera M6 Wing: 1.1, 9.3, 12.12, 15.4NASA CRM: 6.2, 26.5, 30NASA Trap Wing: 20, 66
Simulation TypeSteady RANS Simulations
10 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Onera M6 NASA CRM Trap Wing
Configurations & Workloads (Million)
Onera M6 Wing: 1.1, 9.3, 12.12, 15.4NASA CRM: 6.2, 26.5, 30NASA Trap Wing: 20, 66
Simulation TypeSteady RANS Simulations
10 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Computing Platform: NVIDIA PSG
Node configurationTwo Hexa–deca core Intel(R) Xeon(R) Haswellprocessors.Eight NVIDIA Tesla K–80 GPUs.
GPU Memory = 12 GB.Total CPU Memory per node = 256 GB.Infiniband interconnect
SoftwarePGI Compiler 16.7OPENMPI 1.10.2OpenACC 2.0
11 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Computing Platform: NVIDIA PSG
Node configurationTwo Hexa–deca core Intel(R) Xeon(R) Haswellprocessors.Eight NVIDIA Tesla K–80 GPUs.
GPU Memory = 12 GB.Total CPU Memory per node = 256 GB.Infiniband interconnect
SoftwarePGI Compiler 16.7OPENMPI 1.10.2OpenACC 2.0
11 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPUParallel Performance Parameters
Ideal Speed–upRatio of number of nodes used for a given run to referencenumber of nodes.
Actual Speed–upRatio of time/iteration using reference number of nodes totime/iteration using number of nodes for given run.
Accelerator Speed–upRatio of time per iteration obtained using given no. of CPUsto time per iteration obtained using same no. of CPUsworking in tandem with GPUs.
12 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPUSingle Node Performance
Accelerator Speed–up on 2 GPU
ObservationsIncrease in grid size increases GPU utilization andaccelerator speed–up.Important to load GPU completely.
13 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPUSingle Node Performance
Varying GPUs % Increase
ObservationsIncrease in no. of GPUs increase acceleratorspeed–up.Use of 4 GPUs per node is optimal.
14 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPUSingle Node Performance
Time to RANS Solution (Hours)
ObservationsTime to solution on 1 million grid ∼ 15 minutes.Time to solution on 30 million grid ∼ half a day.Single node serves as a desktop supercomputer.
15 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPUMulti–node Performance
Parallel Speed–up: 66 Million Workload
ObservationsNear linear speed–up using 2 GPUs per node.Drop in speed–up for larger no. nodes and/or higherGPUs due to lower GPU utilization.
16 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPUMulti–node Performance
Normalized Time Per Iteration: 66 Million WorkloadObservations
Drop in time/iter with increase in no. of nodes and/orGPUs.Time to solution with 8 nodes ∼ 4 hours.
17 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Concluding Remarks
Offload model to port HiFUN on GPU.GPU based computing node is powerful enough toserve as desktop supercomputer.HiFUN is ideally suited to solve grand challengeproblems on GPU based hybrid supercomputers.OpenACC directives based offload model is anattractive option for porting legacy CFD codes on GPU.
18 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Concluding Remarks
Offload model to port HiFUN on GPU.GPU based computing node is powerful enough toserve as desktop supercomputer.HiFUN is ideally suited to solve grand challengeproblems on GPU based hybrid supercomputers.OpenACC directives based offload model is anattractive option for porting legacy CFD codes on GPU.
18 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Concluding Remarks
Offload model to port HiFUN on GPU.GPU based computing node is powerful enough toserve as desktop supercomputer.HiFUN is ideally suited to solve grand challengeproblems on GPU based hybrid supercomputers.OpenACC directives based offload model is anattractive option for porting legacy CFD codes on GPU.
18 / 18
HiFUN onGPU
Krishnababuet. al.
HiFUN on NVIDIA GPU
Concluding Remarks
Offload model to port HiFUN on GPU.GPU based computing node is powerful enough toserve as desktop supercomputer.HiFUN is ideally suited to solve grand challengeproblems on GPU based hybrid supercomputers.OpenACC directives based offload model is anattractive option for porting legacy CFD codes on GPU.
18 / 18