large scale computational luid dynamics simulations...
TRANSCRIPT
LARGE SCALE COMPUTATIONAL FLUID DYNAMICS SIMULATIONS ON HYBRID SUPERCOMPUTERS
Eric Kelmelis
EM Photonics
May 16, 2012
Copyright © 2012 EM Photonics – EM Photonics Proprietary
MODERNIZATION OF CFD SOLVERS
» NAVAIR uses CFD solvers to model ships and their interaction with moving planes
» Full ships require >500,000 CPU hours on a supercomputer
» More performance means faster turnaround times for simulations and more complex models
2
Case Study » CFD solvers are used throughout the DoD, NASA, and elsewhere » New solvers are being
developed
» Real world CFD simulations require massive computational power
» HPC hardware technology has recently evolved much faster than software implementations » Current solvers do not
scale to new hardware
Copyright © 2012 EM Photonics – EM Photonics Proprietary
PROJECT OVERVIEW
» We are experts in high-performance computing and accelerated processing
» Experience in scientific computing in general and CFD specifically
» Project Goal: Improve performance of CFD solvers on massive HPC systems
Models Meshes Solvers Visuals
Copyright © 2012 EM Photonics – EM Photonics Proprietary
CFD SOLVERS ADDRESSED
4
Created by: AFRL
» 50k lines of Fortran
» Can be coupled to external codes
Used by: Primarily DoD
AVUS
Created by: NASA Langley
» 765k lines of Fortran
» Multiphysics
Used by: NASA, commercial, some DoD
FUN3D
» Unstructured grid solvers » “MPI Everywhere” structure » Desire to scale to massive problems (10,000+ cores) » Include many legacy manual optimizations
Copyright © 2012 EM Photonics – EM Photonics Proprietary
CFD SOLVERS AT NAVAIR
» NAVAIR uses:
» FUN3D
» AVUS
» USM3D and Cobalt also popular
» Several other solvers and tools used too
Copyright © 2012 EM Photonics – EM Photonics Proprietary
FULL SYSTEM HARDWARE HYBRIDIZATION
» The three major chip makers are all following the heterogeneous “accelerator model”
» Develop solutions that use entire system (multicore + accelerators) and scale as new hardware is added
» Communication is more complex and critical to performance and scaling
NVIDIA Intel AMD
Heavily backing their mature CUDA technologies
Developing “Knights Corner” in addition to multicore
processors
Supports the OpenCL standard for their CPUs &
graphics cards
Intel MIC
6
Copyright © 2012 EM Photonics – EM Photonics Proprietary
MODERN HIGH-PERFORMANCE COMPUTING NODE
7
Node
CPU CPU
GPU GPU
Node
CPU
High-performance computing nodes have evolved
Copyright © 2012 EM Photonics – EM Photonics Proprietary
PREVIOUS GENERATION HPC ARCHITECTURE
8
Node
CPU
Node
CPU
MP
I
MPI
MPI
Node
CPU
Node
CPU M
PI
MPI
MPI
Node
CPU
Node
CPU
MP
I
MPI
MPI
Node
CPU
Node
CPU
MP
I
Nodes contained single-core CPUs and were connected by MPI
Copyright © 2012 EM Photonics – EM Photonics Proprietary
Node Node
MPI-BASED PARALLELIZATION
9
CPU CPU
Node
CPU
Node
CPU
Job
Results
MPI
MPI
MP
I
MP
I
» Jobs equally divided between identical nodes
» Larger problems can be solved
» Processing time is reduced
MPI is designed for inter-node
communication
Copyright © 2012 EM Photonics – EM Photonics Proprietary
CURRENT SOFTWARE ON MODERN HPC NODES
10
Node
CPU
GPU
MPI
MPI
MP
I
MP
I
CPU
MPI
MPI
MP
I
MP
I
GPU
» MPI software can be run on multicore chips without modification so it often is
» MPI software cannot take advantage of vector processors (GPUs)
What is needed » Extend software to take
advantage of GPUs » Optimize intra-chip
communication of multicore chips
» MPI not the most efficient way for scaling
» Continue to use MPI for inter-node communication
MPI MPI
MP
I M
PI
Copyright © 2012 EM Photonics – EM Photonics Proprietary
PROFILING FUN3D
» Performed profiling using NAVAIR inputs
» Observed similar results to NASA
Copyright © 2012 EM Photonics – EM Photonics Proprietary
PROFILING AVUS
» Air Force CFD solver » AFRL/VAAC
» Profiling shows over 60% of compute time is in one routine » Gauss-Seidel
solver
» Similar to NASA FUN3D’s profile
12
Copyright © 2012 EM Photonics – EM Photonics Proprietary
INITIAL PROTOTYPING
13
» Prototyped most computationally intense routine on GPU
» Work distribution became dramatically unbalanced – no overall improvement in speed
Accelerating Bottlenecks
» Initial prototypes showed the need to escape intra-chip MPI parallelism and static work distribution
» Prototyped main computational region
» 16x speedup that could not be realized
» Completely update AVUS for OpenMP + MPI
A New Approach
Accelerated
Accelerated
CPU
CPU
MP
I
Single Mean Flow Sweep
Wait
Wait
Copyright © 2012 EM Photonics – EM Photonics Proprietary
SOFTWARE EVOLUTION ROADMAP
» First step is extend the underlying framework to handle work queues
» CPU optimization and GPU programming can happen concurrently
14
Current Version of AVUS
Step 1 Add Work Queue
Capability to AVUS
Step 2 Optimize CPU Version
with OpenMP
Step 3 Extend AVUS to Work
on GPUs
Copyright © 2012 EM Photonics – EM Photonics Proprietary
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
SOFTWARE EVOLUTION STEP 1: WORK QUEUE
15
Job
Job is broken up into pieces
and distributed to processors in
the system
Job
Create a queue of work
that can be dynamically
distributed to processors
Original AVUS Modified Version
Copyright © 2012 EM Photonics – EM Photonics Proprietary
ADVANTAGES OF THE WORK QUEUE
16
» Slowest worker sets whole system pace
» Slower CPU
» Larger work unit
» Problem partitioning is static
» Rebalancing can be added later but not as efficient
Current Model
» Different processing rates supported
» Increase number of work packets – workers receive different amounts
» Easily supports dynamic reallocation
» Allows for mixed processor speeds
Work Queue Approach
Copyright © 2012 EM Photonics – EM Photonics Proprietary
SOFTWARE EVOLUTION STEP 2: CPU OPTIMIZATION
17
Original AVUS Modified Version
Node
CPU
MPI
MPI
MP
I
MP
I
CPU
MPI
MPI
MP
I
MP
I
MPI
Memory
Node
CPU CPU
Memory
Each core has its own memory that cannot be shared. Communication with MPI.
Transition to OpenMP and shared memory to improve intra-node efficiency
Copyright © 2012 EM Photonics – EM Photonics Proprietary
ADVANTAGES OF USING ONE PROCESS
» Less communication overhead
» More efficient data transfers through shared memory
» Reduced memory usage
» Allows partitioning work to GPU while still using the CPU
» OpenMP chosen – minimal disruption to existing code
18
Copyright © 2012 EM Photonics – EM Photonics Proprietary
SOFTWARE EVOLUTION STEP 3: ADDING GPUS
» GPU processes portions of the work for critical routines » Amount of work taken
based on its relative performance
» GPU code impact is small » Could change
accelerators easily
» OpenMP allows for flexibility
19
Node
Process
MPI
OpenMP
Copyright © 2012 EM Photonics – EM Photonics Proprietary
ADVANTAGES OF ADDING GPUS
» GPUs are moving from novelty to mainstream in current generation HPCs
» GPUs have much more computational power and are more computationally efficient than CPUs
» Next-generation hardware from all major vendors will incorporate GPU-like vector processing
20
NV47 G80
G90 Tesla
Fermi
Core2 Duo Core2 Duo Core2 Quad Core i7
Core i7
0
500
1000
1500
2005 2006 2007 2008 2009 2010 2011
Pe
ak G
FLO
PS
GPU Performance Trends
NVIDIA GPU
Intel CPU
Copyright © 2012 EM Photonics – EM Photonics Proprietary
PERFORMANCE
GPU Results » A 16x and 13x performance
increase were realized in two important code sections
» This leads to a 35% reduction in overall runtime
» Further tuning is ongoing
21
0
0.5
1
1.5
2
2.5
Kernel 1 Kernel 2
CPU Time
GPU Time
0
20
40
60
80
100
120
59200 cells
Original Time
EMP Time
Overall Runtime Runtime of prototype segments
CPU Results » Memory usage decreased by 20%
» More drastic as number of cores increases (4 core)
» Flexible architecture allows for runtime performance adaptation
» Speed gains will be realized over current MPI-only model
Copyright © 2012 EM Photonics – EM Photonics Proprietary
SOME PICTURES
Copyright © 2012 EM Photonics – EM Photonics Proprietary
BENEFITS OF OUR APPROACH
23
LOWER MEMORY USAGE
Shared data is truly shared
On CPU and GPU
Supports accelerator paradigms
Respond to conditions on-the-fly
HIGHER PERFORMANCE
FLEXIBLE ARCHITECTURE
DYNAMIC BALANCING
Optimizing communication allows scaling to more nodes
MORE SCALABLE
Copyright © 2012 EM Photonics – EM Photonics Proprietary
NEXT STEPS
» Remove intra-chip MPI
» Direct memory transfers
» Implement dynamic load balancing
» Move zones between
threads and nodes
» Move additional code segments to the GPU
» Only one subroutine thus far
» Transition into Kestrel
24
Copyright © 2012 EM Photonics – EM Photonics Proprietary
CONCLUSION
» HPC model has changed » Many CPU cores need to be optimally managed within a
node and GPU technology must be leveraged » Vital to CFD
» Current CFD solvers are unable to scale up to the number of cores available today
» Applicable Elsewhere » Computationally intense software in other fields can
benefit » We are always looking for new problems to tackle
Questions?
25
Copyright © 2012 EM Photonics – EM Photonics Proprietary
BIG PICTURE
26
Node
Process
Node
Process
MPI
OpenMP OpenMP
Copyright © 2012 EM Photonics – EM Photonics Proprietary
COMPUTING TECHNOLOGY TRENDS
Parallel Processing
To achieve maximum performance, an algorithm’s parallelism must be exploited
Parallelism is now available within a single chip, shifting the programming paradigm
Systems are being designed to use different types of processing devices at the same time
Heterogeneous paradigm now being scaled to large HPC systems
Multicore
Heterogeneous Computing Hybrid Supercomputing
Copyright © 2012 EM Photonics – EM Photonics Proprietary
NASA’S FUN3D
» Important CFD code for unstructured grids
» Capable of analyzing a broad range of CFD physics
» Used in commercial, defense, and research
» EM Photonics is working closely with NASA engineers